Edward Capriolo

Monday Apr 30, 2012

hive: SELECT without FROM kinda

I know many of you DBA's do not even use windows calculator. After all you can just call


But hive you could not do that, you always had to do:

SELECT 1+2 FROM <sometable that exists> LIMIT 1;

What a PITA!

Well worry not! Your troubles are over. After thousands of man hours the most brilliant guy who writes this blog has a a solution!


Yes! an input format that returns 1 split and 1 row. Amazing! No matter how many files or rows you in your path you are guaranteed to get 1 and only 1 return row! No more limit clauses.

client.execute("create table  dual  (fake string) "+
"STORED AS INPUTFORMAT 'com.m6d.dualinputformat.DualInputFormat'"+
"OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'");
client.execute("load data local inpath '" + p.toString() + "' into table dual");
client.execute("select count(1) as cnt from dual");
String row = client.fetchOne();
assertEquals( "1", row);

Well that just says it all. Don't it?

Sunday Apr 29, 2012

Hive and Oozie Featuring "Hive Plan B" Action

The last couple of blog posts have been rant like, and really did not mean to get onto tracks like that. I am not trying tell anyone how they should do anything I just want to make a good case for #groundcomputing. Back to something technical people can use.

Oozie has had a HiveAction for a while now but it has some caveats that make it less then attractive to me:

  1. You have to bundle your hive-site.xml with your Hive job
  2. It interacts with hive by layering on top the CLI driver
  3. Hive jobs run on random task trackers
    1. Hive needs to connect to your metastore to run
    2. Hive leaves junk around /tmp/hive-$user etc. etc.

The largest deal breaker is actually item two. In terms of API interacting with the thrift service is cleaner then launching a CLI and trying to program over it. A few posts ago we covered the set up of a redundant hive-thrift server

Well wouldn't it be great if an oozie action would talk to hive service instead  of layering on top of the hive CLI? Enter the HiveService "Plan B" Action. Technically, it is a not an Action it is a program designed to work with java-main-action, but that is just splitting hairs.

To use it :

Make a directory for your project:

  mkdir /opt/oozie-3.0.2/examples/apps/hive_service/

Build your job.properties as usual.


Copy all hive jars into the lib folder.You also will have to build hive_test as well.

 cp /opt/hive-0.8.0/lib/*.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib
cp m6d-oozie-$version.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib
cp hive_test-$version.jar /opt/oozie-3.0.2/examples/apps/hive_service/lib

I also created this helper script to copy the project to hdfs sh pushToHdfs.sh

hadoop dfs -rm '/user/root/oozie/test/lib/*'
hadoop dfs -copyFromLocal /opt/hive-0.7.0/lib/* /user/root/oozie/test/lib
hadoop dfs -copyFromLocal lib/* /user/root/oozie/test/lib
hadoop dfs -rm /user/root/oozie/test/main/workflow.xml
hadoop dfs -copyFromLocal workflow.xml /user/root/oozie/test/main

Now we have everything in place except a job to execute.

<arg>CREATE TABLE IF NOT EXISTS zz_zz_abc (a int, b int)</arg>

Argument 0 is the hive host, argument 1 is the hive port and the rest of the arguments are scripts to be sent to hive thrift's execute() method.

Screen shot incoming:



Friday Apr 27, 2012

Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear? The Revenge

My retort to:

My response to Edward Capriolo’s “Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear?”

Things that you won’t have on your servers for $175k.

  • Atomic-multi data center sub second volume snapshots. EBS volumes rock. Snapshots persistent to S3 are amazing.

Right. I do not need this! Hadoop barely supports appends. Sub millisecond snaphots for hadoop are way overkill. As I mentioned storing my 300TB of data is going to cost $6,000,000 a year. Are you suggesting I take some snapshots to EC2 and increase most costs to $12,000,000 or something?

  • Global redundancy.
This is a can of worms and a difficult problem to approach. Are you talking about redundant data redundant processing or both? Our software stack is like 95% installed and configured by puppet and we have an active/active presence in multiple data centers. This particular hadoop cluster in my comparison does not need to be redundant. Anecdotally our data centers have had better uptime then amazon. If I needed to come up with another DR plan I would 1) find a data center that is connected to my current 1 with "dark fiber". 2) stand up a second cluster over there would it be cheaper then the $6,000,000 a year S3 costs? Almost certainly.
  • Elasticity

EC2 is not the only way to be elastic, although that is what amazon would like people to believe.
Q: You bought 20 servers. How long did that take?
A: Our network is now sufficiently large that we are always doing standing orders. We have spare capacity. We are ELASTIC! Because all our our machines run linux, all of our software is Java, all of our installs are done by puppet. Hadoop cluster "A" is slow and cluster "B" is fast reconfigure in puppet. Done deal. True I can't grow from 100 to 200 servers in a minute. But if I have a 100 servers and 10 tomcat webservers are only 50% capacity, I can install something else on them. Isn't that what a cloud is in a nutshell?

Statement: "Don't care alot when your job runs."
Answer: I do. When you have a report called RTB Lunch, people expect it by lunch. Seriously most of our processes are time dependent.

Honestly, I feel that when people pitch elasticity to me they are just selling me a pipe-dream. "Like dude, my site facepaper.co could blow up at any moment and when it does I'm going to need to auto scale my mongrel app to 900,000 servers" On a more serious note, this hadoop cluster used for batch analytic work. The only way more load can hit it is if we get significantly more data or add many more jobs, ultimate elasticity is not really needed here.

"Now, let’s talk about where you can save 10X. Things AWS excels in that you did not even mention."

1) S3. I know your local SATA drives or SAN are cheaper. But are they designed for 11 9?s of redundancy? Compare that cost. Are they secure and globally accessible? Do they have virtually unlimited bandwidth to your alternate site/customers? Can you just keep growing them and only paying fer allocated space?

Has amazon every produced 11 9s of uptime? Aren't they down like 6 hours a year? on average.

Q: Are they secure and globally accessible?
A: Yes they are secure, and I do not need number 2.

Q: Do they have virtually unlimited bandwidth to your alternate site/customers?
A: Again, I don't need this.

Q: Can you just keep growing them and only paying fer allocated space?
A: I can keep adding disks to my hadoop datanodes and keep adding data nodes. I have already established the storage cost is less. (For reference when you buy the nodes buy a chassis with 8 bays but only fill 6. Hot swap in a drive and restart hadoop.)

2. bandwidth.

Well again this is a hadoop cluster I do not need external bandwidth. Generally not convinced EC2 bandwidth is cheap.

3. DynamoDB, SimpleDB-

You may have convinced me to do another blog because I am pretty sure I could use Cassandra and save you tons of money of DynamoDB. But again this is a hadoop cluster. This is out of scope.

4. Support staff.

I'm the "monkey". Hadoop is not hard it's easy. Disk dies replace it. No outage. Server dies. Call vendor. Ship replacement to DC. No outage. I am not going to touch this because it goes into a noops rant. Dealing with physical hardware is a small% of our ops time.  When we are not being monkeys us ops guys server other useful purposes. I'm just not going any further with this.

5 Opportunity costs, paradigms, and other big ideas.

So back in the day they had mainframes, then we moved to, 1u and 2u style servers, then blade centers, now amazon has us back to "time sharing" mainframes. Believe me I am ready for the future. The cloud is not the future. It's just a rehash of virtual hosted web servers with a newer dashboard.

6 Development flexibility

Just like your production servers your dev servers are going to cost more on amazon too.
Again I use puppet so trying to convince me that "AWS magic clones setups" is not happening.

You or one of your engineers wants to ‘try out’ something new. How much is that server?
Again have a spare pool. It's elastic. Why don't the devs just take some money out of their pay check and start up a cloud server themselves? I'm not stopping them. They must be worried about something ... Hint $$$. Just saying. I test stuff on my laptop.

7 growth. yell about “Everything being down” right at the moment you made it big.

Fiction! and Anecdotal. You trying to say amazon never goes down when you made it big?


I seem to remember my network being up this day :) Seriously though not to joke at others misfortune but less complicated things tend to fail less. Amazon has many failures from self imposed complexity you do not need.

Wrap up:

Amazon is great and does cool stuff. But the point of my original post was to outline a simple use case the cost of a 20 node hadoop cluster and show it's costs. The retort article pointed out that nothing is apples to apples. But I would almost call Jon's argument the "heated seats" argument.

I came here to buy a car, your trying to sell me a car with heated seats. I came here to store my data in hadoop, your telling me about point in time snapshots. I need fast hardware to run my hive jobs, your telling me about "bidding on spot instances".

Just like heated seats I am being told to "just pay a little more a month" on something I do not need. None of the things amazon is offering will make my hadoop cluster any better (arguably) and nothing will make it cheaper factually.

Monday Apr 23, 2012

Myth Busters: Ops edition. Is XEN performance negligible ?

In my last blog I established that it would cost 6 million dollars a year to store our 300TB data in the cloud.  Well the bad news just keeps coming for Amazon in this head to head battle. My ace in the whole is my long time rant over Xen vs container style virtualisation. Let's take a quick look at some Xen claims around the interwebs:


"depends on what you are planning to do. If you use paravirtualisation (for 
linux guests), then the overhead is quite low. Maybe 1-5% (max).

If you use hvm (hardware virtualisation), then there is definitely a bigger
performance loss."


"The overhead of running Xen is very small indeed, about 3%."


"preliminary tests run with Xen[25] we
have indications that using virtual machines introduces a very
small overhead (less than 5%). "

Well we have some off-the-cuff estimates that Xen does not have much overhead. Unfortunately 5% of 100,000 is $5,000 and 5% of 5000 is $250. At minimum you have to accept the notion that your paying 5% for nothing. You can call it overhead, your can call it FooBar but if you would have spent that $ on a real server would have 5% more foobar to deploy your application on.

When you look at anything even semi-scientific you can see that these 1-3% claims are all puffery. Last time I checked 5% was not "very small". If 5% was missing from your pay check would you notice?



A good question someone asked me was.

<nphase> ecapriolo: so why doenst everyone arbitrarily use openvz?

Great question! Let's examine this. For the record I do not use OpenVZ. I use http://linux-vserver.org/Welcome_to_Linux-VServer.org . Why? 

  1. Size of Image: Do you remember the old days when you could readily identify every single file in your *nix operating system? Now since most distros have to support USB wireless LAN, and flash thus the distros have gotten larger. But when you are dealing with servers you do not need about 95% of that. !!!!!!!!!!!!GET READY FOR THIS!!!!!!!!  Our minimal vserver template is a whopping 100MB! When I need to make a vserver for tomcat sure I need java and tomcat but this usually never gets larger then 200MB. This is a total game changer. I have a laptop at home with about 60 different images. It seems like the average AMI I find is 1GB-8GB. A great example of how the cloud is filled with bloat and excess. Who cares about a little more transfer here or there? You not paying for this right ?
  2. Performance - Containers are lean and mean. A container is little more then a jail. Think about full virtualization or paravirtualization, emulating kernels and interrupts, emulating network card, emulating physical devices. *Nix is multi user. How does emulating everything, including that bloated file system mentioned in point #1 gets you anything? Containers makes it easy to leverage mechanisms to control CPU, memory,  number of files. Most of this work has been mainlined into Linux Containers aka /cgroups now. The cost of the container is little more then the weight of a process. I regularly have 10 running on my laptop at any given time. The idle cost is the same as any idle process. I do not need 2GB dedicated heap to idle efficiently.
  3. Mainline. Not only is XEN not that efficient it is essentially abandoned and again. I forget how the story goes but microsoft bought citrix who created xen and then RedHad lost it's Xen boner and switched to KVM. 

So nphase, "Why doenst everyone arbitrarily use openvz?".

Well I guess you have the case where you need to run Windows and FreeBSD on the same server........

Is Xen performance negligible? Busted! 5% or more is not negligible.

So if your struggling on the AWS calculator trying to finagle numbers to make my cost equal to yours do not forget you need at least 5% more servers to make up for lost overhead. %5 of 20 servers is 1 entire server BTW.

Friday Apr 20, 2012

Myth Busters: Ops edition. Is EC2 is less expensive then running your own gear?

A buddy of mine and I were trying to figure out how much better ground computing is then cloud computing. We determined our typical hadoop machine is about equal to an extra large instance. We have a cluster of 20 of these machines. 2x quad socked processors, 32GB RAM, 8x 2TB disks

Here are some numbers:

20 x extra large instances 12 hours of usage per day on one year term.

Its like you have to pay about 80K upfront and about 5K a month. So in a year its going to be about 140K -150K (without any storage)

Now If i was to run these servers in my data-center :

20 servers = about 150K (upfront cost)
Rack and power = about 2K a month
Total expenses in year one about 175K about 25-35K more from AWS - looks bad, huh?

but.... With our own hardware

Free storage about 400TB (Raw)
Cluster available 24 hours
After first year cost is only 2K a month (24k/year) and all the hardware is free.

I went ahead for 3 year comparison:


$126K upfront for AWS and about 5K a month = 186K for first year
120K for next two year
total needs to be paid to AWS = 126+120 = 246K

If i were to put this in my own datacenter i'd pay:
150K upfront and 2K a month = about 175K for first year
48K for next two year
total expenses = 223K
free storage about 400TB (Raw)
cluster available 24 hours a day
I still have those 3 year old machine maybe someone will buy them for 10K? :)

BTW - I'm not too sure how to calculate $$$ for storage in AWS yet, but when I do it its come to about 585K a month - assuming that you want to keep all the data in AWS...

Myth Busted? Absolutely busted to heck.

Wednesday Apr 18, 2012

The Real Slim Shady - C* parody

"The real no-sql"

May I have your data please?
May I have your data please?
Will the best NoSQL db please start up?
I repeat will the best nosql solution please start up?
We're gonna have a split-brain problem here...

Yall act like you never seen Big Data before
Jaws all on the the floor like Hadoop The elephant burst in the door
Been servin request before your nosql took its first core
in YCSB some discourse, our bar charts towering over you score(AHH!)
Its the return of the.. "Ah wait, no way, your replicating"
He didn't drop what I think he did, did he?
And Dr Dobbs said... Nothing you idioms,
Dr Dobbs read, "Cassandra is your MySQL replacement"  (ha ha!)
Sensationalist tech reporters love Cassandra
[*vocal turntable: chigga chigga chigga*]
"Cassandra, im sick of it
Look at it, walking around denormalizing you-know-what,
replicating to ec2, "Yeah, but is soo cool though!"
Yea, It's probably got a couple bugs up in the patch queue,
But no worse, then whats goin on in your agile scrum room
Sometimes, I wanna get cnet and let fud loose, but can't
but its cool for someone to question CAP proof's
"My data on your disk, my data on your disk"
And it I'm lucky, you might read with some thrift.
And thats the message we deliver to college kids
And expect them not to know what eventual consistency is
Or course they gonna know what read-repair is,
by the time they hit the deployment stage,
they hang on the IRC dont they?
"We aint nothing but mammals.." Well, some of us cannibals
who rip solandra up like cantaloupes
Then theres no reason Java and C code can't cope
[*ewww!*] But if your 95th is hurt, I got the antidote
Others send your twitter firehose, install the Cassandra package and it goes

Cause I'm NoSQL , yes I'm the real NoSQL,
all your other noSQLs are just imitating,
So won't the real NoSQL please start up,
Please start up, please start up.

Thursday Apr 05, 2012

Exploring IronCount 2.0 through an example

Like all good 1.0 releases IronCount was destined to be made 'good'er by a 2.0 release.

The package layout was re-factored. This was cosmetic but it makes hacking at the project easier.

The MessageHandler interface was given a new method stop(). Though the message handler interface will remain as stable as possible, stop was needed. The framework attempts to execute stop() whenever a workload is paused or removed. This is useful if a process is batching up writes that are not yet committed to their destination. Stop() can allow you to clean up.

package com.jointhegrid.ironcount.manager;
import kafka.message.Message;
public interface MessageHandler extends java.io.Serializable {
  public void setWorkload(Workload w);
  public void handleMessage(Message m) ;
  public void stop();
  public void setWorkerThread(WorkerThread wt) ; //just to call commit

A new feature was added to Workload as well. 

 public List<URL> classloaderUrls;

Workloads could be added and removed dynamically, however the Java classpath was loaded at start time and could not change. This means that it might have taken a shut down and restart to update a handler or add new ones. With this feature the list of URLs is used by the WorkerThread to construct a URLClassLoader. This construct gives us the ability to Hot Deploy handlers without restarting the entire application.

The other option for doing this considered was the method hadoop uses. Essentially building Fat Jars and shipping them along with your job. I found that problematic over the years.  I felt the URLClassLoading system gives a maven-esc feel. Theoretically if (when :) IronCount catches on big time, users can leave handlers on public web servers, and other users can use them with the URLClassloader. Also across a network managing jars on a set of web servers is trivial, or you could list jars in your existing maven repository.

The next feature that is a big game changer is JMX. I considered adding a web interface to manage everything, but I am tired of web interfaces. When deploying a Workload users used to have to start an instance of WorkloadManager to leverage the API and talk to Zookeeper. IronCount 2.0 exposes the applyWorkload(String) method through JMX. This makes it easy to start and stop workloads.

UUID of nodes running the job.

We also have statistics of messages processes and time spent processing.

I had a requirement to do some real time calculations of city and state from our event logs. IronCount to the rescue! This is a perfect use case for Cassandra counter feature. Read a line and update counters. We apply some batching because there is no reason to send one increment per source row.

package com.example.rt;

import com.jointhegrid.ironcount.manager.MessageHandler;
import com.jointhegrid.ironcount.manager.WorkerThread;
import com.jointhegrid.ironcount.manager.Workload;
import com.jointhegrid.ironcount.mockingbird.CounterBatcher;
import java.nio.ByteBuffer;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.GregorianCalendar;
import java.util.Map;

import kafka.message.Message;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.beans.HCounterColumn;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;

public class CityStateHandler implements MessageHandler {

  Workload w;
  WorkerThread wt;
  Cluster cluster;
  Keyspace keyspace;
  DateFormat bucketByMinute;
  CounterBatcher<String, String> cb;

  public CityStateHandler() {
    bucketByMinute= new SimpleDateFormat("yyyy-MM-dd-HH-mm");
    cb = new CounterBatcher<String, String>();

  public void setWorkload(Workload w) {
    this.w = w;
    cluster = HFactory.getOrCreateCluster("rts", w.properties.get("rts.cas"));
    keyspace = HFactory.createKeyspace(w.properties.get("rts.ks"), cluster);

  public void handleMessage(Message m) {

  public void doCityStateCount(Message m){
    try {
      String[] parts = getMessage(m).split("\t");
      long time_in_ms = Long.parseLong(parts[1]);
      String city = parts[9];
      String state = parts[10];
      GregorianCalendar gc = new GregorianCalendar();
      cb.increment(bucketByMinute.format(gc.getTime()), state + "," + city, 1);
      if (cb.incCounter % 1000 == 0) {
    } catch (Exception ex) {

  public void setWorkerThread(WorkerThread wt) {
    this.wt = wt;

  public void stop() {

  public static String getMessage(Message message) {
    ByteBuffer buffer = message.payload();
    byte[] bytes = new byte[buffer.remaining()];
    return new String(bytes);

  public void persist() {
    for (String key : cb.incMap.keySet()) {
      Map<String, Long> cols = cb.incMap.get(key);
      for (Map.Entry<String, Long> e : cols.entrySet()) {
        Mutator<String> mut = HFactory.createMutator(keyspace, StringSerializer.get());
        HCounterColumn<String> hc = HFactory.createCounterColumn(e.getKey(), e.getValue());
        mut.addCounter(key, w.properties.get("rts.cf"), hc);

Lets look at some results.

[default@rts] get rts['2012-04-05-17-00'];
=> (counter=WI,Stoughton, value=2)
=> (counter=WI,SunPrairie, value=2)
=> (counter=WI,Superior, value=1)
=> (counter=WI,Thiensville, value=2)
=> (counter=WI,Waukesha, value=8)
=> (counter=WI,Waunakee, value=1)
=> (counter=WI,Waupaca, value=1)
=> (counter=WI,WestBend, value=2)
=> (counter=WI,WisconsinRapids, value=1)
=> (counter=WV,Beckley, value=5)
=> (counter=WV,Bethany, value=1)

Sweet! IronCount 2.0 strikes again! Super easy to build aggregations on real time data. Now, it is easier to manage and deploy with JMX and URL class loading.

An alternative to writing handlers from scratch is using a Domain Specific Language along the lines of HStreaming, acunu-analytics, and Countandra. There is feature mismatch and overlap between IronCount and these three projects, but IronCount focuses on being the lightest framework possible.

Sunday Apr 01, 2012

Redundant failover capable hive thrift

A while back Nathan covered a blog on production-izing-hive-thrift. Well as you know the #1 and #2 considerations in my life are scalability and redundancy.

We were doing too much (as in at all) bash + "hive -e" and I was happy to get to the future with hive-service. There is one new consideration however, running one instance hive-service is not enough. All the jobs will launch from a single node which could cause utilization to spike. Second, if that node hangs up or goes down there is going to be a world of hurt.

Step 1: Run two hive thrift severs. This is easy, just follow Nathan's instructions two or more times. The result:


Next we can call our networking guys to setup a TCP based load balancer or setup a proxy. In this case because hive thrift is not very network intensive I went with a simple HA-proxy configuration.

listen hive-primary
balance leastconn
mode tcp
server hivethrift1 rs06.hadoop.pvt:10000 check
server hivethrift2 rs07.hadoop.pvt:10000 check

Now here is the kicker. What if the node with HA-proxy fails? Isn't that a single point of failure? Yes it is a single point of failure, but since this service has no real state we can fail it over to a new node fairly quickly with linux-ha. Linux-ha allows us to manage resources across clusters of servers.

Many people who deploy linux-ha still deploy it in the deprecated V1 mode where all the resources must live on one side of the other on a two node clusters. If you are one of those people it is time to man-up and switch to V2. V2 allows resources to run independently across N node clusters.

The IPAddr2 resource agent is built into linux-ha. The HAProxy OCF can be downloaded separately at  https://github.com/russki/cluster-agents.

In linux-ha groups move together and start up in order. That means in this case the VIP and haproxy instance move together and start in a specific order. (They also stop in the reverse order).

crm# config
crm(live)configure# show
primitive haproxy_10000 ocf:heartbeat:HAProxy \
        params config="/etc/haproxy/rhiveserver.cfg" pid="/var/run/haproxy_rhive.pid" proxy="/usr/sbin/haproxy"
primitive ip_ha_hiveserver ocf:heartbeat:IPaddr2 \
        params ip="" cidr_netmask="16" nic="eth1"
group ha_hiveserver_grp ip_ha_hiveserver haproxy_10000 \
meta target-role="Started"

Then we can use the crm_mon tool to interactively watch where the resource comes up.

[root@rs02 ~]# crm_mon -1
Last updated: Sun Apr  1 11:36:05 2012
Stack: Heartbeat
Current DC: rs02.hadoop.pvt (3fa9313f-0e32-4277-be23-b71fdd8af14d) - partition with quorum
Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
7 Resources configured.

Online: [ rs03.hadoop.pvt rs02.hadoop.pvt ]

 Resource Group: ha_hiveserver_grp
     ip_ha_hiveserver    (ocf::heartbeat:IPaddr2):    Started rs03.hadoop.pvt
     haproxy_10000    (ocf::heartbeat:HAProxy):    Started rs03.hadoop.pvt


Users can reach hive-service on the redundant failover capable vip on

Mission Accomplished!