Friday May 18, 2012

Be ware of benchmarks in lab jackets

So I recently ran into the white paper:

Using Paxos to Build a Scalable, Consistent,
      and Highly Available Datastore

http://t.co/51qNTjpe

Generally speaking when I hear about a scalable, highly available, data store I prepare myself to see some outlandish claims. None of that happened here, but there was one thing that caught my eye.  


Something is rotten in Denmark. First, without any math or benchmarking these lines should not be 20MS apart.  Your issuing a single write to a random Cassandra node (a coordinator) which launches N parallel writes to the nodes responsible for that data. Writes in Cassandra are memory bound and network bound. You could get some additional latency on QUORUM possibly 1-5ms, but 20 is highly suspect.

But other then a discrepency between 20ms and 2ms there is another problem, the problem that caught my eye in the first place, the lines should look like this:


The reason for this is that regardless or writing at quorum or one or all, in the end the write will happen on N nodes. At saturation the write stage would back up and the WEAK write performance would converge with the QUORUM write performance. If the two lines are not converging then the benchmark must not be fully loading Cassandra. (then what is it benchmarking) 

NoSQL should really have an an established NoSortium like TPH-C, that can run benchmarks on common hardware and verify results. 


Saturday May 12, 2012

On Key Value vs Column Family

The one line, mile-high view of Apache Cassandra is as follows:

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model.

I have worked with the project for some time now, and I was doing some reflection on this statement and how profound Cassandra's design is. Amazon Dynamo's data model is a key value store, while Cassandra's data model is a ColumnFamily-based one. You can 'encapsulate' the difference like this:

Key value:
Map<Object,Object>

Column Family
Map<Object, Map<Object,Object>>
There are many successful tools that are key value stores, notably Memcache. Some NoSQL products take the next step and turn key value store into a distributed key value store. The ColumnFamily model is more complex with its larger signature  Map<Object,Object>, but that provides greater flexibility and allows certain applications to be modelled that would be difficult to do with a pure key/value store.

For example, imagine your are collecting page visits for a users web session. User bob visits 4 pages, index.html, stuff.html, purchase.html, and checkout.html.

In a pure key-value system this requirement is difficult to model. The first stumbling block is you can not append in a pure key/value store, so you will have to read, append, and then write back.
List<Object> newResults = get visit[bob]
results.add( $time-index.html,$time-stuff.html,$time-purchase.html,$time-checkout.html )
set visit[bob]= newResults
The above pseudo code only works in a single threaded writer otherwise you could face lost update problems. 
 
One way the key value system could handle this is:
server.lock( bob )
List<Object> newResults = get visit[bob]
results.add( $time-index.html , $time-stuff.html, $time-purchase.html,$time-checkout.html ) <--client side stuff
set visit[bob]= newResults
server.unlock ( bob )

This has many drawbacks. The first is you are going to need 3-4 client server exchanges to complete this, and the second is handling the situation where a client vanishes after acquiring the lock is a problem along with all the other distributed locking problems.

Vector Clocks are another system to handle this type of problem without the multiple round-trips. Vector clock systems are cool but the explanation/proof of how it works involves a bit more math then I am used to.

Let's take a look at how the ColumnFamily based model solves this problem:

set visits[bob][$time-index.html] =''
set visits[bob][$time-stuff.html] =''
set visits[bob][$time-purchase.html] =''
set visits[bob][$time-checkout.htm] = ''

Cassandra's use of ColumnFamily are a straight forward solution when compared to Vector Clocks, locking, or transactions. The application is responsible for creating a unique key for the map. In this example, we appended a timeUUID infront of the page name. This design does not need to "read before write" to accomplish an "append", nor does it need vector clocks on the server side to reconsile the data internally.

The structured log format using Memtables and SSTable allows for fast write and a system can merge on read. This can be a huge optimization for many applications. For example, at m6d we have our mapping servers writing URL information into something like the visit table I described above. If we had to 'read before write' or 'select before insert' we would need significantly more hardware, (if I had to guess I would say 2-3 times). Likewise we can add (or remove) Segments information to or from a user without ever having to issue a read.

Also for we could just use Cassandra as a key value store if we had no need for the map. In the example  below 'k' is a static char is the only column in the column family.

set Visists['bob']['k']='index.html,page2.html,page3'

This was a contrived example, but many applications can be fit well into the ColumnFamily model. Time series is one such example, lists inside a key, sets inside a key, different structure types under the same key. The key being a map rather then just a single opaque entity is very powerful and practical. The tunable consistency model, distributed design, and column family based approach are an optimal tool for many applications.

Saturday May 05, 2012

Rank with hive

So I wanted to do a few things.

1) Promote a cool open source edition to hive by m6d (Rank UDF)
2) Promote the upcoming "Programming Hive" book (that I am co-authoring)

What better way then to give a preview of the m6d rank case study in the programming hive book?

--M6d UDF Pseudo Rank--

by David Ha and Rumit Patel

Sorting data and identifying the top n elements is straightforward. You order the whole data set by some criteria and limit the result set to n. But there are times when you need to group like elements together and find the top n elements within that group only. For example, identifying the top ten requested songs for each recording artist or the top 100 best selling items per product category and country. Several database platforms define a rank() function that can support these scenarios, but until Hive provides an implementation, we can create a user-defined function to produce the results we want. We will call this function p_rank() for psuedo-rank, leaving the name rank() for the Hive implementation.

Say we have the following product sales data and we want to see the top three items per category and country:

|===============================
|category       |country        |product        |sales
|movies |us     |chewblanca     |100
|movies |us     |war stars iv   |150
|movies |us     |war stars iii  |200
|movies |us     |star wreck     |300
|movies |gb     |titanus        |100
|movies |gb     |spiderella     |150
|movies |gb     |war stars iii  |200
|movies |gb     |war stars iv   |300
|================================

In most SQL systems :
[source,sql]
--------------------------
SELECT
 category,country,product,sales,rank
FROM (
 SELECT
   category,country,product, sales,
   rank() over (PARTITION BY category, country ORDER BY sales DESC) rank
 FROM p_rank_demo) t
WHERE rank <= 3
-----------------------------

To achieve the same result using HiveQL, the first step is partitioning the data into groups, which we can achieve using the distribute by clause. We must ensure that all rows with the same category and country are sent to the same reducer.

[source,sql]
--------------------------
DISTRIBUTE BY
 category,
 country
-------------------------

The next step is ordering the data in each group by descending sales using the sort by clause. While order by effects a total ordering across all data, sort by affects the ordering of data on a specific reducer. You must repeat the partition columns named in the distribute by clause.

[source,sql]
--------------------------
SORT BY
 category,
 country,
 sales DESC
--------------------------
Putting everything together, we have:

[source,sql]
--------------------------
ADD JAR p-rank-demo.jar;
CREATE TEMPORARY FUNCTION p_rank AS 'demo.PsuedoRank';

SELECT
 category,country,product,sales,rank
FROM (
 SELECT
   category,country,product,sales,
   p_rank(category, country) rank
 FROM (
   SELECT
     category,country,product,
     sales
   FROM p_rank_demo
   DISTRIBUTE BY
     category,country
   SORT BY
     category,country,sales desc) t1) t2
WHERE rank <= 3
----------------------------------

There is a little more to the case study but I have to save something for the book... The code is here:

https://github.com/edwardcapriolo/hive-rank

Enjoy!


Friday May 04, 2012

Hadoop is the best thing since sliced bread

Stonebreaker is at it again:
http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories/fulltext

I seem to remember the last "big news" Stonebreaker had was another facebook criticism.

http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/

There is some irony here that facebook is about to go IPO for what 10 billion. And all those "five parallel DBMSs" are probably not recognizable names to the average tech person.

Yea! I said it, I really did not want to nasty but whatever, your picking on hive "that's my brother man".

First off, I do not even understand what this article is trying to get at. We know that hadoop is not MPI, we know that hadoop and hive is not a true parallel database, and we know that you can not fit every problem into hadoop or hive. Just like you can not fit every problem into a relational database even a parallel one.

Last, I checked no one goes around saying "Hadoop is the best thing since sliced bread" in fact the only one who goes around acting like his shit don't stink is Stonebreaker himself.

Let me tell you a story of how I got into hadoop and hive. I was following advice like Stonebreaker's that said Parallel DBs are the way to go. But I quickly found out Parallel Database are too rich for my blood.  Now, I am not telling you or anyone else that you should not spend money on Parallel DBs, because maybe you have the money, or maybe you need some of those things the parallel database provides. But for things I need to do:

  • store tons of data
  • processed it reasonably fast
  • be LOW on the cost scale

Hadoop and hive work fine for me.

Stonebreaker:, "Most Hadoop sites are somewhere between steps 2 and 3, and “hitting the wall” is still in their future"

This remind me of a quote between the infamous rap battle between LL Cool J and Cannabis. 

Cannabis: "99% of your fans wear high heals."
LL Cool J: "99% of your fans don't exist."

VoltDB, need I say more.

Most Hadoop sites NEVER WANTED A PARALLEL DATABASE IN THE FIRST PLACE. So there is no wall to hit. Were not as smart as you, but we are not fucking idiots. We did not use NoSQL because we want a parallel database and we did not use Hadoop and hive because we wanted one either.

Hive is only a declarative language that makes working with hadoop look like querying a database. It does a good job paralleling some relational problems on top of hadoop.

Also an irony of the article is the mention of the hype-cycle. This would be a fair criticism of Hadoop and Big Data that we are in hype-cycle. However when your "go-to move" for promoting yourself and your products is to fling fud at others you do not have a leg to stand on.

Also what is with having a picture every time you have an article. Who does that?


Wednesday May 02, 2012

My hate for static

Let me kick this entry off my saying, "take all my programming advice with a grain of salt". I only get to spend a fraction of my time programmer, and at times my opinions are stronger then my skills.

My Advice: DON'T USE STATIC FOR ANYTHING!

So many times i have been screwed by static. My latest example is 'import static'. At first import static seems really cool. Who wants to keep typing Assert.assertEquals() after all? So... you start using import static for this and for that then just about everything.

Now, today I was trying to hack up hector so it works with Cassandra 1.1. https://github.com/edwardcapriolo/hector. I hit a bug in a unit test and now I am trying to trace it down.

   TestCleanupDescriptor cleanup = insertColumns(cf, 4,
"testRangeSlicesQuery", 3, "testRangeSlicesQueryColumn");    

Wait? Why is insertColumns() not found anywhere in this file? Ah duh it's a static import. Old school guys like myself forget that now you can do that. Well that is not so bad let me head up to the top of the file to find what class this bad boy lives in....

import static me.prettyprint.hector.api.factory.HFactory.createColumn;
import static me.prettyprint.hector.api.factory.HFactory.createColumnQuery;
import static me.prettyprint.hector.api.factory.HFactory.createCountQuery;
import static me.prettyprint.hector.api.factory.HFactory.createCounterColumn;
import static me.prettyprint.hector.api.factory.HFactory.createCounterColumnQuery;
import static me.prettyprint.hector.api.factory.HFactory.createKeyspace;
import static me.prettyprint.hector.api.factory.HFactory.createMultigetSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createMultigetSubSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createMultigetSuperSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createMutator;
import static me.prettyprint.hector.api.factory.HFactory.createRangeSlicesQuery;
import static me.prettyprint.hector.api.factory.HFactory.createRangeSubSlicesQuery;
import static me.prettyprint.hector.api.factory.HFactory.createRangeSubSlicesCounterQuery;
import static me.prettyprint.hector.api.factory.HFactory.createRangeSuperSlicesQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createCounterSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createCounterSuperColumn;
import static me.prettyprint.hector.api.factory.HFactory.createSubColumnQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSubCountQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSubSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSubSliceCounterQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSuperColumn;
import static me.prettyprint.hector.api.factory.HFactory.createSuperColumnQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSuperCountQuery;
import static me.prettyprint.hector.api.factory.HFactory.createSuperSliceQuery;
import static me.prettyprint.hector.api.factory.HFactory.getOrCreateCluster;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;
import static org.junit.Assert.assertTrue;
import static org.junit.Assert.fail;

Damn. Well now I can't find where the method is. Now I know what you are going to say and your right.

"Your IDE probably has some keystroke to goto definition, or you can just search your code."

Well your right! The problem is I do not know them! Again this is a reflection on my poor programming skills, however. its also a reflection on something deeper.

One of Java's benefits is packaging. At times people bitch about like com.myco.whatever, but its purpose makes sense. better com.myco.Thing.insertColumns() then having a bunch of global method and not knowing which package they come from c/c++ for example. Most importantly to me at the moment, "Which class actually owns insertColumns() ? :)"

So yes. import static. From a consensus standpoint, I believe you should only use it for a method that the majority of people know exactly what you are talking about.

assertEquals() good. junit is a common package almost all are familiar with. Save some typing.

 

insertColumns() bad. No one outside the code base would be familiar with this. Put the name of the class here.

This is no reflection on the hector code base quality, again do not use my arguments on a programming interview because you might not get a job. But remember there are folks that sometimes have to hack code bases in vi and stuff and not everyone is an IDE keystroke superstar.

Static has also screwed me in various open source projects as well, mostly not being able to extend or change a feature, or thread local problems. I would like to get into that more but I have work to do. I really avoid static in every way.

Monday Apr 30, 2012

hive: SELECT without FROM kinda

I know many of you DBA's do not even use windows calculator. After all you can just call

SELECT 1+2;

But hive you could not do that, you always had to do:

SELECT 1+2 FROM <sometable that exists> LIMIT 1;

What a PITA!

Well worry not! Your troubles are over. After thousands of man hours the most brilliant guy who writes this blog has a a solution!

https://github.com/edwardcapriolo/DualInputFormat

Yes! an input format that returns 1 split and 1 row. Amazing! No matter how many files or rows you in your path you are guaranteed to get 1 and only 1 return row! No more limit clauses.

client.execute("create table  dual  (fake string) "+
"STORED AS INPUTFORMAT 'com.m6d.dualinputformat.DualInputFormat'"+
"OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'");
client.execute("load data local inpath '" + p.toString() + "' into table dual");
client.execute("select count(1) as cnt from dual");
String row = client.fetchOne();
assertEquals( "1", row);

Well that just says it all. Don't it?

Sunday Apr 29, 2012

Hive and Oozie Featuring "Hive Plan B" Action

The last couple of blog posts have been rant like, and really did not mean to get onto tracks like that. I am not trying tell anyone how they should do anything I just want to make a good case for #groundcomputing. Back to something technical people can use.

Oozie has had a HiveAction for a while now but it has some caveats that make it less then attractive to me:

  1. You have to bundle your hive-site.xml with your Hive job
  2. It interacts with hive by layering on top the CLI driver
  3. Hive jobs run on random task trackers
    1. Hive needs to connect to your metastore to run
    2. Hive leaves junk around /tmp/hive-$user etc. etc.

The largest deal breaker is actually item two. In terms of API interacting with the thrift service is cleaner then launching a CLI and trying to program over it. A few posts ago we covered the set up of a redundant hive-thrift server

Well wouldn't it be great if an oozie action would talk to hive service instead  of layering on top of the hive CLI? Enter the HiveService "Plan B" Action. Technically, it is a not an Action it is a program designed to work with java-main-action, but that is just splitting hairs.

To use it :

Make a directory for your project:

  mkdir /opt/oozie-3.0.2/examples/apps/hive_service/

Build your job.properties as usual.

nameNode=hdfs://rs01.hadoop.pvt:34310
jobTracker=rjt.hadoop.pvt:34311
queueName=default
oozie.libpath=/user/root/oozie/test/lib
oozie.wf.application.path=${nameNode}/user/root/oozie/test/main

Copy all hive jars into the lib folder.You also will have to build hive_test as well.

 cp /opt/hive-0.8.0/lib/*.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib
cp m6d-oozie-$version.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib
cp hive_test-$version.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib

I also created this helper script to copy the project to hdfs sh pushToHdfs.sh

hadoop dfs -rm '/user/root/oozie/test/lib/*'
hadoop dfs -copyFromLocal /opt/hive-0.7.0/lib/* /user/root/oozie/test/lib
hadoop dfs -copyFromLocal lib/* /user/root/oozie/test/lib
hadoop dfs -rm /user/root/oozie/test/main/workflow.xml
hadoop dfs -copyFromLocal workflow.xml /user/root/oozie/test/main

Now we have everything in place except a job to execute.

    <java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.m6d.oozie.HiveServiceBAction</main-class>
<arg>rhiveservice.hadoop.pvt</arg>
<arg>10000</arg>
<arg>CREATE TABLE IF NOT EXISTS zz_zz_abc (a int, b int)</arg>
</java>

Argument 0 is the hive host, argument 1 is the hive port and the rest of the arguments are scripts to be sent to hive thrift's execute() method.

Screen shot incoming:

 

 

Friday Apr 27, 2012

Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear? The Revenge

My retort to:

My response to Edward Capriolo’s “Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear?”

Things that you won’t have on your servers for $175k.

  • Atomic-multi data center sub second volume snapshots. EBS volumes rock. Snapshots persistent to S3 are amazing.

Right. I do not need this! Hadoop barely supports appends. Sub millisecond snaphots for hadoop are way overkill. As I mentioned storing my 300TB of data is going to cost $6,000,000 a year. Are you suggesting I take some snapshots to EC2 and increase most costs to $12,000,000 or something?

  • Global redundancy.
This is a can of worms and a difficult problem to approach. Are you talking about redundant data redundant processing or both? Our software stack is like 95% installed and configured by puppet and we have an active/active presence in multiple data centers. This particular hadoop cluster in my comparison does not need to be redundant. Anecdotally our data centers have had better uptime then amazon. If I needed to come up with another DR plan I would 1) find a data center that is connected to my current 1 with "dark fiber". 2) stand up a second cluster over there would it be cheaper then the $6,000,000 a year S3 costs? Almost certainly.
  • Elasticity

EC2 is not the only way to be elastic, although that is what amazon would like people to believe.
Q: You bought 20 servers. How long did that take?
A: Our network is now sufficiently large that we are always doing standing orders. We have spare capacity. We are ELASTIC! Because all our our machines run linux, all of our software is Java, all of our installs are done by puppet. Hadoop cluster "A" is slow and cluster "B" is fast reconfigure in puppet. Done deal. True I can't grow from 100 to 200 servers in a minute. But if I have a 100 servers and 10 tomcat webservers are only 50% capacity, I can install something else on them. Isn't that what a cloud is in a nutshell?

Statement: "Don't care alot when your job runs."
Answer: I do. When you have a report called RTB Lunch, people expect it by lunch. Seriously most of our processes are time dependent.

Honestly, I feel that when people pitch elasticity to me they are just selling me a pipe-dream. "Like dude, my site facepaper.co could blow up at any moment and when it does I'm going to need to auto scale my mongrel app to 900,000 servers" On a more serious note, this hadoop cluster used for batch analytic work. The only way more load can hit it is if we get significantly more data or add many more jobs, ultimate elasticity is not really needed here.

"Now, let’s talk about where you can save 10X. Things AWS excels in that you did not even mention."

1) S3. I know your local SATA drives or SAN are cheaper. But are they designed for 11 9?s of redundancy? Compare that cost. Are they secure and globally accessible? Do they have virtually unlimited bandwidth to your alternate site/customers? Can you just keep growing them and only paying fer allocated space?

Has amazon every produced 11 9s of uptime? Aren't they down like 6 hours a year? on average.

Q: Are they secure and globally accessible?
A: Yes they are secure, and I do not need number 2.

Q: Do they have virtually unlimited bandwidth to your alternate site/customers?
A: Again, I don't need this.

Q: Can you just keep growing them and only paying fer allocated space?
A: I can keep adding disks to my hadoop datanodes and keep adding data nodes. I have already established the storage cost is less. (For reference when you buy the nodes buy a chassis with 8 bays but only fill 6. Hot swap in a drive and restart hadoop.)

2. bandwidth.

Well again this is a hadoop cluster I do not need external bandwidth. Generally not convinced EC2 bandwidth is cheap.

3. DynamoDB, SimpleDB-

You may have convinced me to do another blog because I am pretty sure I could use Cassandra and save you tons of money of DynamoDB. But again this is a hadoop cluster. This is out of scope.

4. Support staff.

I'm the "monkey". Hadoop is not hard it's easy. Disk dies replace it. No outage. Server dies. Call vendor. Ship replacement to DC. No outage. I am not going to touch this because it goes into a noops rant. Dealing with physical hardware is a small% of our ops time.  When we are not being monkeys us ops guys server other useful purposes. I'm just not going any further with this.

5 Opportunity costs, paradigms, and other big ideas.

So back in the day they had mainframes, then we moved to, 1u and 2u style servers, then blade centers, now amazon has us back to "time sharing" mainframes. Believe me I am ready for the future. The cloud is not the future. It's just a rehash of virtual hosted web servers with a newer dashboard.

6 Development flexibility

Just like your production servers your dev servers are going to cost more on amazon too.
Again I use puppet so trying to convince me that "AWS magic clones setups" is not happening.

You or one of your engineers wants to ‘try out’ something new. How much is that server?
Again have a spare pool. It's elastic. Why don't the devs just take some money out of their pay check and start up a cloud server themselves? I'm not stopping them. They must be worried about something ... Hint $$$. Just saying. I test stuff on my laptop.

7 growth. yell about “Everything being down” right at the moment you made it big.

Fiction! and Anecdotal. You trying to say amazon never goes down when you made it big?

http://money.cnn.com/2011/04/21/technology/amazon_server_outage/index.htm

I seem to remember my network being up this day :) Seriously though not to joke at others misfortune but less complicated things tend to fail less. Amazon has many failures from self imposed complexity you do not need.

Wrap up:

Amazon is great and does cool stuff. But the point of my original post was to outline a simple use case the cost of a 20 node hadoop cluster and show it's costs. The retort article pointed out that nothing is apples to apples. But I would almost call Jon's argument the "heated seats" argument.

I came here to buy a car, your trying to sell me a car with heated seats. I came here to store my data in hadoop, your telling me about point in time snapshots. I need fast hardware to run my hive jobs, your telling me about "bidding on spot instances".

Just like heated seats I am being told to "just pay a little more a month" on something I do not need. None of the things amazon is offering will make my hadoop cluster any better (arguably) and nothing will make it cheaper factually.

Monday Apr 23, 2012

Myth Busters: Ops edition. Is XEN performance negligible ?

In my last blog I established that it would cost 6 million dollars a year to store our 300TB data in the cloud.  Well the bad news just keeps coming for Amazon in this head to head battle. My ace in the whole is my long time rant over Xen vs container style virtualisation. Let's take a quick look at some Xen claims around the interwebs:

http://old-list-archives.xen.org/archives/html/xen-users/2007-08/msg00093.html

"depends on what you are planning to do. If you use paravirtualisation (for 
linux guests), then the overhead is quite low. Maybe 1-5% (max).

If you use hvm (hardware virtualisation), then there is definitely a bigger
performance loss."

http://www.linuxjournal.com/article/8540

"The overhead of running Xen is very small indeed, about 3%."

 www.cca08.org/papers/Poster10-Simone-Leo.pdf

"preliminary tests run with Xen[25] we
have indications that using virtual machines introduces a very
small overhead (less than 5%). "

Well we have some off-the-cuff estimates that Xen does not have much overhead. Unfortunately 5% of 100,000 is $5,000 and 5% of 5000 is $250. At minimum you have to accept the notion that your paying 5% for nothing. You can call it overhead, your can call it FooBar but if you would have spent that $ on a real server would have 5% more foobar to deploy your application on.

When you look at anything even semi-scientific you can see that these 1-3% claims are all puffery. Last time I checked 5% was not "very small". If 5% was missing from your pay check would you notice?

ppadala.net/research/dyncontrol/im07.pdf

http://wiki.openvz.org/Performance

A good question someone asked me was.

<nphase> ecapriolo: so why doenst everyone arbitrarily use openvz?

Great question! Let's examine this. For the record I do not use OpenVZ. I use http://linux-vserver.org/Welcome_to_Linux-VServer.org . Why? 

  1. Size of Image: Do you remember the old days when you could readily identify every single file in your *nix operating system? Now since most distros have to support USB wireless LAN, and flash thus the distros have gotten larger. But when you are dealing with servers you do not need about 95% of that. !!!!!!!!!!!!GET READY FOR THIS!!!!!!!!  Our minimal vserver template is a whopping 100MB! When I need to make a vserver for tomcat sure I need java and tomcat but this usually never gets larger then 200MB. This is a total game changer. I have a laptop at home with about 60 different images. It seems like the average AMI I find is 1GB-8GB. A great example of how the cloud is filled with bloat and excess. Who cares about a little more transfer here or there? You not paying for this right ?
  2. Performance - Containers are lean and mean. A container is little more then a jail. Think about full virtualization or paravirtualization, emulating kernels and interrupts, emulating network card, emulating physical devices. *Nix is multi user. How does emulating everything, including that bloated file system mentioned in point #1 gets you anything? Containers makes it easy to leverage mechanisms to control CPU, memory,  number of files. Most of this work has been mainlined into Linux Containers aka /cgroups now. The cost of the container is little more then the weight of a process. I regularly have 10 running on my laptop at any given time. The idle cost is the same as any idle process. I do not need 2GB dedicated heap to idle efficiently.
  3. Mainline. Not only is XEN not that efficient it is essentially abandoned and again. I forget how the story goes but microsoft bought citrix who created xen and then RedHad lost it's Xen boner and switched to KVM. 

So nphase, "Why doenst everyone arbitrarily use openvz?".

Well I guess you have the case where you need to run Windows and FreeBSD on the same server........

Is Xen performance negligible? Busted! 5% or more is not negligible.

So if your struggling on the AWS calculator trying to finagle numbers to make my cost equal to yours do not forget you need at least 5% more servers to make up for lost overhead. %5 of 20 servers is 1 entire server BTW.


Friday Apr 20, 2012

Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear?

A buddy of mine and I were trying to figure out how much better ground computing is then cloud computing. We determined our typical hadoop machine is about equal to an extra large instance. We have a cluster of 20 of these machines. 2x quad socked processors, 32GB RAM, 8x 2TB disks

Here are some numbers:
http://calculator.s3.amazonaws.com/calc5.html?key=calc-50102DB3-33D8-4E3F-A6B8-1D5F7585D1C5

20 x extra large instances 12 hours of usage per day on one year term.

Its like you have to pay about 80K upfront and about 5K a month. So in a year its going to be about 140K -150K (without any storage)

Now If i was to run these servers in my data-center :

20 servers = about 150K (upfront cost)
Rack and power = about 2K a month
Total expenses in year one about 175K about 25-35K more from AWS - looks bad, huh?

but.... With our own hardware

Free storage about 400TB (Raw)
Cluster available 24 hours
After first year cost is only 2K a month (24k/year) and all the hardware is free.

I went ahead for 3 year comparison:

http://calculator.s3.amazonaws.com/calc5.html?key=calc-5E8BCB11-697A-4432-A13A-450BE3D7F213

$126K upfront for AWS and about 5K a month = 186K for first year
120K for next two year
total needs to be paid to AWS = 126+120 = 246K

If i were to put this in my own datacenter i'd pay:
150K upfront and 2K a month = about 175K for first year
48K for next two year
total expenses = 223K
+
free storage about 400TB (Raw)
cluster available 24 hours a day
I still have those 3 year old machine maybe someone will buy them for 10K? :)

BTW - I'm not too sure how to calculate $$$ for storage in AWS yet, but when I do it its come to about 585K a month - assuming that you want to keep all the data in AWS...

Myth Busted? Absolutely busted to heck.


Wednesday Apr 18, 2012

The Real Slim Shady - C* parody

"The real no-sql"

May I have your data please?
May I have your data please?
Will the best NoSQL db please start up?
I repeat will the best nosql solution please start up?
We're gonna have a split-brain problem here...

Yall act like you never seen Big Data before
Jaws all on the the floor like Hadoop The elephant burst in the door
Been servin request before your nosql took its first core
in YCSB some discourse, our bar charts towering over you score(AHH!)
Its the return of the.. "Ah wait, no way, your replicating"
He didn't drop what I think he did, did he?
And Dr Dobbs said... Nothing you idioms,
Dr Dobbs read, "Cassandra is your MySQL replacement"  (ha ha!)
Sensationalist tech reporters love Cassandra
[*vocal turntable: chigga chigga chigga*]
"Cassandra, im sick of it
Look at it, walking around denormalizing you-know-what,
replicating to ec2, "Yeah, but is soo cool though!"
Yea, It's probably got a couple bugs up in the patch queue,
But no worse, then whats goin on in your agile scrum room
Sometimes, I wanna get cnet and let fud loose, but can't
but its cool for someone to question CAP proof's
"My data on your disk, my data on your disk"
And it I'm lucky, you might read with some thrift.
And thats the message we deliver to college kids
And expect them not to know what eventual consistency is
Or course they gonna know what read-repair is,
by the time they hit the deployment stage,
they hang on the IRC dont they?
"We aint nothing but mammals.." Well, some of us cannibals
who rip solandra up like cantaloupes
Then theres no reason Java and C code can't cope
[*ewww!*] But if your 95th is hurt, I got the antidote
Others send your twitter firehose, install the Cassandra package and it goes


Cause I'm NoSQL , yes I'm the real NoSQL,
all your other noSQLs are just imitating,
So won't the real NoSQL please start up,
Please start up, please start up.

Thursday Apr 05, 2012

Exploring IronCount 2.0 through an example

Like all good 1.0 releases IronCount was destined to be made 'good'er by a 2.0 release.

The package layout was re-factored. This was cosmetic but it makes hacking at the project easier.

The MessageHandler interface was given a new method stop(). Though the message handler interface will remain as stable as possible, stop was needed. The framework attempts to execute stop() whenever a workload is paused or removed. This is useful if a process is batching up writes that are not yet committed to their destination. Stop() can allow you to clean up.

package com.jointhegrid.ironcount.manager;
import kafka.message.Message;
public interface MessageHandler extends java.io.Serializable {
  public void setWorkload(Workload w);
  public void handleMessage(Message m) ;
  public void stop();
  public void setWorkerThread(WorkerThread wt) ; //just to call commit
}

A new feature was added to Workload as well. 

 public List<URL> classloaderUrls;

Workloads could be added and removed dynamically, however the Java classpath was loaded at start time and could not change. This means that it might have taken a shut down and restart to update a handler or add new ones. With this feature the list of URLs is used by the WorkerThread to construct a URLClassLoader. This construct gives us the ability to Hot Deploy handlers without restarting the entire application.

The other option for doing this considered was the method hadoop uses. Essentially building Fat Jars and shipping them along with your job. I found that problematic over the years.  I felt the URLClassLoading system gives a maven-esc feel. Theoretically if (when :) IronCount catches on big time, users can leave handlers on public web servers, and other users can use them with the URLClassloader. Also across a network managing jars on a set of web servers is trivial, or you could list jars in your existing maven repository.

The next feature that is a big game changer is JMX. I considered adding a web interface to manage everything, but I am tired of web interfaces. When deploying a Workload users used to have to start an instance of WorkloadManager to leverage the API and talk to Zookeeper. IronCount 2.0 exposes the applyWorkload(String) method through JMX. This makes it easy to start and stop workloads.

UUID of nodes running the job.

We also have statistics of messages processes and time spent processing.

I had a requirement to do some real time calculations of city and state from our event logs. IronCount to the rescue! This is a perfect use case for Cassandra counter feature. Read a line and update counters. We apply some batching because there is no reason to send one increment per source row.

package com.example.rt;

import com.jointhegrid.ironcount.manager.MessageHandler;
import com.jointhegrid.ironcount.manager.WorkerThread;
import com.jointhegrid.ironcount.manager.Workload;
import com.jointhegrid.ironcount.mockingbird.CounterBatcher;
import java.nio.ByteBuffer;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.GregorianCalendar;
import java.util.Map;


import kafka.message.Message;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.beans.HCounterColumn;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;

public class CityStateHandler implements MessageHandler {

  Workload w;
  WorkerThread wt;
  Cluster cluster;
  Keyspace keyspace;
  DateFormat bucketByMinute;
  CounterBatcher<String, String> cb;


  public CityStateHandler() {
    bucketByMinute= new SimpleDateFormat("yyyy-MM-dd-HH-mm");
    cb = new CounterBatcher<String, String>();
  }

  @Override
  public void setWorkload(Workload w) {
    this.w = w;
    cluster = HFactory.getOrCreateCluster("rts", w.properties.get("rts.cas"));
    keyspace = HFactory.createKeyspace(w.properties.get("rts.ks"), cluster);
  }

  @Override
  public void handleMessage(Message m) {
    doCityStateCount(m);
  }

  public void doCityStateCount(Message m){
    try {
      String[] parts = getMessage(m).split("\t");
      long time_in_ms = Long.parseLong(parts[1]);
      String city = parts[9];
      String state = parts[10];
      GregorianCalendar gc = new GregorianCalendar();
      gc.setTimeInMillis(time_in_ms);
      cb.increment(bucketByMinute.format(gc.getTime()), state + "," + city, 1);
      if (cb.incCounter % 1000 == 0) {
        persist();
        cb.clear();
      }
    } catch (Exception ex) {
      System.out.println(ex);
    }
  }

  @Override
  public void setWorkerThread(WorkerThread wt) {
    this.wt = wt;
  }

  @Override
  public void stop() {
    persist();
    cb.clear();
  }

  public static String getMessage(Message message) {
    ByteBuffer buffer = message.payload();
    byte[] bytes = new byte[buffer.remaining()];
    buffer.get(bytes);
    return new String(bytes);
  }

  public void persist() {
    for (String key : cb.incMap.keySet()) {
      Map<String, Long> cols = cb.incMap.get(key);
      for (Map.Entry<String, Long> e : cols.entrySet()) {
        Mutator<String> mut = HFactory.createMutator(keyspace, StringSerializer.get());
        HCounterColumn<String> hc = HFactory.createCounterColumn(e.getKey(), e.getValue());
        mut.addCounter(key, w.properties.get("rts.cf"), hc);
        mut.execute();
      }
    }
  }
}

Lets look at some results.

[default@rts] get rts['2012-04-05-17-00'];
...
=> (counter=WI,Stoughton, value=2)
=> (counter=WI,SunPrairie, value=2)
=> (counter=WI,Superior, value=1)
=> (counter=WI,Thiensville, value=2)
=> (counter=WI,Waukesha, value=8)
=> (counter=WI,Waunakee, value=1)
=> (counter=WI,Waupaca, value=1)
=> (counter=WI,WestBend, value=2)
=> (counter=WI,WisconsinRapids, value=1)
=> (counter=WV,Beckley, value=5)
=> (counter=WV,Bethany, value=1)

Sweet! IronCount 2.0 strikes again! Super easy to build aggregations on real time data. Now, it is easier to manage and deploy with JMX and URL class loading.

An alternative to writing handlers from scratch is using a Domain Specific Language along the lines of HStreaming, acunu-analytics, and Countandra. There is feature mismatch and overlap between IronCount and these three projects, but IronCount focuses on being the lightest framework possible.

Sunday Apr 01, 2012

Redundant failover capable hive thrift

A while back Nathan covered a blog on production-izing-hive-thrift. Well as you know the #1 and #2 considerations in my life are scalability and redundancy.

We were doing too much (as in at all) bash + "hive -e" and I was happy to get to the future with hive-service. There is one new consideration however, running one instance hive-service is not enough. All the jobs will launch from a single node which could cause utilization to spike. Second, if that node hangs up or goes down there is going to be a world of hurt.

Step 1: Run two hive thrift severs. This is easy, just follow Nathan's instructions two or more times. The result:


(http://www.askmen.com)

Next we can call our networking guys to setup a TCP based load balancer or setup a proxy. In this case because hive thrift is not very network intensive I went with a simple HA-proxy configuration.

/etc/haproxy/rhiveserver.cfg
listen hive-primary 10.71.74.218:10000
balance leastconn
mode tcp
server hivethrift1 rs06.hadoop.pvt:10000 check
server hivethrift2 rs07.hadoop.pvt:10000 check

Now here is the kicker. What if the node with HA-proxy fails? Isn't that a single point of failure? Yes it is a single point of failure, but since this service has no real state we can fail it over to a new node fairly quickly with linux-ha. Linux-ha allows us to manage resources across clusters of servers.

Many people who deploy linux-ha still deploy it in the deprecated V1 mode where all the resources must live on one side of the other on a two node clusters. If you are one of those people it is time to man-up and switch to V2. V2 allows resources to run independently across N node clusters.

The IPAddr2 resource agent is built into linux-ha. The HAProxy OCF can be downloaded separately at  https://github.com/russki/cluster-agents.

In linux-ha groups move together and start up in order. That means in this case the VIP and haproxy instance move together and start in a specific order. (They also stop in the reverse order).

$crm 
crm# config
crm(live)configure# show
primitive haproxy_10000 ocf:heartbeat:HAProxy \
        params config="/etc/haproxy/rhiveserver.cfg" pid="/var/run/haproxy_rhive.pid" proxy="/usr/sbin/haproxy"
primitive ip_ha_hiveserver ocf:heartbeat:IPaddr2 \
        params ip="10.71.74.218" cidr_netmask="16" nic="eth1"
group ha_hiveserver_grp ip_ha_hiveserver haproxy_10000 \
meta target-role="Started"

Then we can use the crm_mon tool to interactively watch where the resource comes up.

[root@rs02 ~]# crm_mon -1
============
Last updated: Sun Apr  1 11:36:05 2012
Stack: Heartbeat
Current DC: rs02.hadoop.pvt (3fa9313f-0e32-4277-be23-b71fdd8af14d) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
7 Resources configured.
============

Online: [ rs03.hadoop.pvt rs02.hadoop.pvt ]

 Resource Group: ha_hiveserver_grp
     ip_ha_hiveserver    (ocf::heartbeat:IPaddr2):    Started rs03.hadoop.pvt
     haproxy_10000    (ocf::heartbeat:HAProxy):    Started rs03.hadoop.pvt

 

Users can reach hive-service on the redundant failover capable vip on 10.71.74.218.

Mission Accomplished!


Monday Mar 19, 2012

More Taco Bell Programming with Solandra


[Read More]

Saturday Mar 17, 2012

The NoDev revolution has started

Developers primary function is wheel reinvention. Take for
example the Solandra https://github.com/tjake/Solandra application which allows the Solr application to be sharded across multiple Cassandra nodes. This application was re-implemented in Hbase using https://github.com/akkumar/hbasene, only to be reimplemented in Hbase again a few
months later in http://www.slideshare.net/KyungseogOh/solbase.  

When developers get tired of directly writing code to solve problems they are paid to solve they move onto generalizations and frameworks. This allows them to write code that does not solve a specific problem, but solves a set of problems that your company does not have. An example of this would be Java based web development. Writing Servlets to access databases quickly becomes repetitive. Changing the servlet specification and creating multiple servlet engines like Jetty, Tomcat, and Glassfish was not enough. The next step for developers is to create large frameworks like Enterprise Java Beans. After years of development, and large sums of money, these things are usually invalidated and people move onto other frameworks like Hibernate, JPA, or the Spring Framework.

Knowing multiple complex systems that accomplish next to nothing allows developers to become lead developers and negotiate much higher salaries. Once developers become advanced enough they typically know about 20 frameworks. They are capable of writing code that grants them a high level of job security because finding another developer that knows those same 20 frameworks is almost impossible. Then they then enter the "other language" phase of their career.

The other language phase is the realization that since none of the 20 frameworks they know are not sufficient to do their job and they must learn some other programming language. The second language is typically very esoteric, useless and non standard like Go, Clojure, or nodejs. They less people in the world that have heard of it the better!

This offers a chance for developers to go back to the wheel reinventing phase under the guise of the new language. Soon the developers will be back to writing frameworks, with funky names like phibernate (This is perl port of Hibernate (http://hibernate.org).) Or they may try uber-integrations combinations like Clojure/spring/hibernate http://www.coderanch.com/t/539167/clojure/Incubator/Clojure-Spring-Hibernate. Functional languages are particularly hot at the moment because they allow programers to revisit languages like Lisp and ML that have not been wheel reinvented since the 60s.

Developers also typically engage in internal reorganization efforts so they can develop useless frameworks faster. A few of these are extreme programming, waterfall, agile, and recently scrum. Combined with no serious formal certification processes other then cheating through college, devs use these systems as a way to divert attention away from the fact that they are reinventing code and instead focus on how quickly and efficiently they are reinventing it.

Now that we understand the role of the developer in the traditional enterprise we can start to understand the NoDev alternative.

Nodev stresses not coding something unless you really-really have to. Our first action is not to fork a codebase on github. For example, if there is a business need for full text search, use puppet to quickly install already existing solutions like Solandra into a development environment. Also use your bash skills to deliver a small relevant proof of concept or quickly write the entire solution without any frameworks. Actions like these save your company thousands of man hours and eliminate the need for a good portion of development.

NoDev focuses on the end user. The end user is not the developer, the end user is the customer or person using the application. End users care about results. End users would rather see a nice report generated by crystal reports rather then hearing about how a developer spent 3 hours designing that same report in functional scala. End users don't care that you spent 6 hours learning and setting up scalading when you can get the same results with a three line hive program. NoDev embraces the fact that sometimes you can save money by spending money, and that solving problems does not always require more devs and more code.

Suppose your company wants to add a new feature to their product. In the ass backwards DevOps fashion Developers have liberal god like access to your
MySQL database. They will likely use some code generation utility that will create the table in the most inefficient way possible creating columns as tinytext rather then varchar(25). They will then use continuous development to put
this poorly designed system into production. Which will likely effect other applications running on the database and cause the DBA to have to work late or worry about load alarms. Meanwhile ops staff will be busily adding more read slaves to handle load as the developer is at home brogramming their next startup idea.

The key focus of NoDevs is carrying a big stick and using it often. First
create users will the minimal privileges needed to accomplish a problem. This may mean for a logging application the user only has access to INSERT to SPECIFIC columns of SPECIFIC tables. Second, insist that developers come to you with the schema design and estimated size of the table, number of reads and writes day for approval.  Developers tend to hate this because according to whatever the flavour of the week coding system (lets say scrum) they now have tight deadlines. Because of there "tight deadlines" they can not afford to have "Operations people" "slowing them down" by telling them that there schema sucks and will need to be re-designed. When they complained about "how they can not complete their waterfall" simply tell management that "We need more DBAs" not more devs, because that is the truth! We do not compromise best practices and give access rights away because we are overworked. Once you start letting developers do schema design there ego becomes super inflated and then they start ideas like NoDev. Because while your are working thanklessly and endlessly behind the scenes cleaning their mess they come to the conclusion that you are not needed any more.

"Other language" problems also have no place in NoDev companies. NoDev enforces a company wide policy of 2 languages only. A scripting language and a compiled language. This eliminates the needs for developers to spend time researching serialization frameworks like AVRO, MessagePack, ProtoBufs, and Thrift to communicate simple strings between languages like OCamel and BashOnBalls. Again, running your enterprise in a NoDev fashion has saved your company thousands of man hours.

NoDev does not hate the cloud. Operations always looks to make things better and more efficient. But a NoDev organization does not allow people that know nothing about operations to tell them how the "cloud" is better. If Operations decides to move to the "cloud" operations do not change. As mentioned noDev will administer our databases in the same strict way. NoDev will still not allow
devs to have root in production in the cloud. NoDev will still not allow users that do not know anything about networking to have firewall control in the cloud. Sudo will still work the same way in the cloud. Normal capacity planning done by ops and management will still decide how resources are allocated in the cloud.  

Now that you understand some basic concepts of NoDev and some examples, we can keep balance in the enterprise. We can stand our ground as operations, DBA, and administrators, switching and routing engineers because for everyone NoOps use case out there we can come up with an equivalent NoDev use case as well.

NoOps prospective:


Dev want to DevOps, fire ops, and move to the cloud with elastic map reduce.

NoDev prospective:

Ops wants to noDev, fire dev, move to Datameer and crystal reports and let business analysts do dev work.