Edward Capriolo

Sunday Apr 13, 2014

21 hour work weeks


Sorry, I can not agree, the world is getting soft. I do not believe you should work yourself to death or bad health or anything like that. But if you like what you do, it is not really work,....I do not really count how much "work" I do a week as a way of showing off or anything, but like hey when I'm not "working" I might still be writing code, or reading about something that will help me work better.

I don't remember anyone in my family even liking the idea of not working hard...Maybe its an italian worker thing :)


Wednesday Apr 09, 2014

Oracle refuses to let me download old JVM

I have a java project that will not build with open JDK. My machine has jdk 1.7 and the target platform is java 1.6. I say to myself, "Hey no problem Ill just download an older jdk"

So I go to oracle.com which forced me to sign up for an oracle account. After I sign up I click the link to download jdk...


God fucking kill yourself...All you "open source" f*ckers...Kill yourselves.

Tuesday Apr 08, 2014

Todays moment of CQL zen

[edward@jackintosh Downloads]$ netstat -nl | grep 9160
tcp        0      0*               LISTEN     
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cassandra-cli
Connected to: "Test Cluster" on
Welcome to Cassandra CLI version 2.0.4

The CLI is deprecated and will be removed in Cassandra 3.0.  Consider migrating to cqlsh.
CQL is fully backwards compatible with Thrift data; see http://www.datastax.com/dev/blog/thrift-to-cql3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] exit;
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh
Connection error: Could not connect to localhost:9160

[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh 9160
Connection error: Could not connect to
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh localhost 9160
Connection error: Could not connect to localhost:9160

Sunday Apr 06, 2014

The problem of making things too easy....


Since assetClass was not added as an index, it was neccessary to add "allow filltering" when creating the CQL query

SELECT data from marketdata.markets where assetClass=? allow filtering

The caveat being, not all data is guaranteed to be returned (which made Cassandra the least favorite of the system I analyzed, not to mention the performance issues encountered when query across the full One million population set)


I seem to recall a blog I wrote:

"I do not like how CQL uses the terms 'PRIMARY KEY' and 'TABLE'. These are terms from the 'System'. If you do not understand what a Column Family is you are blind to the truth, you will not be able to design an effective Cassandra data store."

Obviously this user was under the impression that if Cassandra is providing a query language, that the query language would be efficient. That is not the case. "Allow filtering" should NEVER be used in production. It is a toy given to CLI users for 10 row datasets.

The reason I always prefer thift is that it FORCES you to design effectively, it FORCES you to know what your doing. It is why I don't like calling 'rows' 'partitions' or calling 'columns' 'cells' or calling column families 'tables'. Screw the people that can't RTFM, cql won't save them anyway :)

comparing in memory data stores....

Just putting it out there. Cassandra is not likely to be happy with less then 2 GB heap. What is the point of doing a "benchmark" when you are not very clear on how to use the technology?



The excel spreadsheet with the full data results is attached below.

Please note that in these results, Cassandra is shown in red because it repeatedly failed when executing the query against one million trades

I did various searches for this and the suggested remedies were confusing. One posting recommended an increase in JVM size. For some reason that was beyond me, increasing the JVM size from 1GB to even 1.25GB meant that the server would not start.

Another post suggested reducing JVM size to keep GC's low.

A whole hosts of other posts were on why people had moved away from Cassandra, then others on exhaustive JVM tuning parameters to get it going (life's too short for that!)

On the whole I would like to discount Cassandra as I had to reduce the population size to 500,000 in order for it to work. But having made the effort, I decided to include it but mark it out as red.

Please infer your own conclusions from this (I have!)


Yes! Thank you for providing 5 meaningless bars! ARge headache! Let me provide some results for you:

mongo:        XXXXX
cassandra:  XXXXXXXXX
fishsticks:   XXXXXXXXXX

As you can see by my exhaustive benchmark shows fishsticks has far more letters then mongo.

Wednesday Apr 02, 2014

Freedom is over

"Open" source projects deny my commits, "social" networks won't let me post things.

It is just time to call it quits everyone.

Thursday Mar 27, 2014

What happen bawwss?


"(Now replaced with an internal database) Similar enough to cassandra "

Maybe not everyone is blowing their load over cql...after all.

Wednesday Mar 26, 2014

thrift is dead, long live the thrift

So on the Cassandra dev list there was recently a vote to end thrift support. I am not really sure how the vote went because the concept is so entirely nonsensical that a schema-less data store ONLY support a schema heavy pseudo query language that I actually un-joined the list.

Well anyway, I am glad that c* no longer wants to support thrift because I can now run a fork and finally add the things I want to thrift and not feel 'dirty' about the process.

To that end I just demonstrated the marvellous CQL query language does not support 'order by' on anything but the, I don't even know what they call it,  in the actual cassandra database, I used to know and love, we called it a 'column name' since a row is sorted by the 'column name'.

Let's think about this problem. Say we have a row filled with counters.

incr [myrow][bob] +5
incr [myrow][john] +9
incr [myrow][pete] +4
incr [myrow][sara] +2

This row is sorted by the column name, bob, pete... etc. But since your goal is to count things, what you probably want is to be able to ask for the highest or the lowest counts....like this...

[myrow] [john] = 9
[myrow][bob] = 5
[myrow][pete] = 4

Lets say you have 4000 columns in a row. The current cassandra solution is to pull down all the columns, and sort them yourself. But what if you only want the top 10 of these rows?

Now, you would think that will all this hubub about how the cassandra project needed to deprecate thrift because it was so hard to manage and so obtuse that the above problem statement is hard to solve...

You would not think someone like me could put together a working prototype in about 3 hours and with ~30 lines of code would you?



(its not totally done yet but you get the point)

Well guess what? I did. I would contribute this to cassandra but the new 'no thrift edict' refuses submissions like this feature. You might be waiting around 3-6 months for this feature until someone figures out how to cql-ify it... sucks to be you.

A great article explaining why I NEVER use regex for things i can do without regex


So I have actually gotten in a debate with several people over why I avoid regex unless it is absolutely required. This pretty much sums it up.


I'm acutely aware of that one because it burns people regularly. These aren't cases of hostile input, they're cases of innocently "erroneous" input. After maybe a year of experience, people using a backtracking regexp engine usually figure out how to write a regexp that doesn't go resource-crazy when parsing strings that *do* match. Those are the inputs the program expects. But all inputs can suffers errors, and a regexp that works well when the input matches can still go nuts trying to match a non-matching string, consuming an exponential amount of time trying an exponential number of futile backtracking possibilities."

Tuesday Mar 25, 2014

Stuff you expect CQL to do...

cqlsh:stats> CREATE TABLE userstats (     username varchar,     countername varchar,     value counter,     PRIMARY KEY (username, countername) ) WITH comment='User Stats' and compact storage;
cqlsh:stats> insert into userstats (username,countername, value) values ('ed','a',1);
Bad Request: INSERT statement are not allowed on counter tables, use UPDATE instead
cqlsh:stats> update userstats (username,countername, value) values ('ed','a',1);
Bad Request: line 1:17 mismatched input '(' expecting K_SET
cqlsh:stats> update userstats set username='ed',countername='a', value='1';
Bad Request: line 1:61 mismatched input ';' expecting K_WHERE
cqlsh:stats> update userstats set username='ed',countername='a', value='1' where username='ed';
Bad Request: PRIMARY KEY part username found in SET part
cqlsh:stats> update userstats set countername='a', value='1' where username='ed';
Bad Request: PRIMARY KEY part countername found in SET part
cqlsh:stats> update userstats set  value='1' where username='ed' and countername='a';
Bad Request: Invalid STRING constant (1) for value of type counter
cqlsh:stats> update userstats set  value=1 where username='ed' and countername='a';
Bad Request: Cannot set the value of counter column value (counters can only be incremented/decremented, not set)
cqlsh:stats> update userstats increment  value=1 where username='ed' and countername='a';
Bad Request: line 1:17 no viable alternative at input 'increment'
cqlsh:stats> update userstats add  value=1 where username='ed' and countername='a';
Bad Request: line 1:17 mismatched input 'add' expecting K_SET
cqlsh:stats> update userstats set  value=value+1 where username='ed' and countername='a';
cqlsh:stats> update userstats set  value=value+1 where username='ed' and countername='a';
cqlsh:stats> update userstats set  value=value+1 where username='ed' and countername='a';
cqlsh:stats> update userstats set  value=value+1 where username='ed' and countername='b';

Haha got you you fer.

cqlsh:stats> select * from userstats order by value;
Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
cqlsh:stats> select * from userstats order by value where username='ed';
Bad Request: line 1:39 missing EOF at 'where'
cqlsh:stats> select * from userstats  where username='ed' order by value;
Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY, got value

Wamp wamp wamp...

Saturday Mar 22, 2014

Security through obscurity and un-automation

Recently I was having a debate with someone over some software feature. The conversation revolved mostly around how groovy class loading was not as secure as building a jar and sending it to a server. This is true, using a URL classloader or something like the nit-compiler creates more avenues of "attack" than going through the process of building a jar, sending it to a server, and then using it.

However using the opportunity cost for not having it can be high. I recently added groovy support to Hive this feature, pig has had similar support for a while.

All this talk about security reminds me of one thing. Usually when someone talks about "security" be aware they are about to c*ck block you. Go back in time with me

::wavey image like time travel scene in movie::

Young Ed was a system administrator at a company, lets call it jdsk.com to protect the not so innocent. Young Ed was forced to build entire software stack from source on every box due to another false belief that "RPMS are slow and unoptimized". To accomplish this, Young Ed would install from source apache/php/mysql/kitchen sink using gcc on every single box that went into production. (These compiles take a LONG time)

One day an old grey beard developer was on a machine (he did not need to be on) and made a comment. It was an old no longer even remotely realistic comment (that did not apply since mainframe days), "Having a compiler on a machine will allow hackers to more easily attack the machine once they have gained access to the machine". Grey beard developer was a supposed security expert because he supposedly once worked for "the government".

Anyway I explained to grey beard developer that his other more senior than me grey beard friends at the job insisted I not install software using RPM packages, and having a compiler on the machine was the only way I could get work done.

Grey beard developer was uncompromising, he suggested a new policy that I should un-install the gcc and developer tools when I was done doing my updates, and re-install these things when I needed them, for "security".

He believed in `Security through un-automation`.The belief that making things take longer makes them more secure. This is is one of the largest fallacies ever. First let us look at reality. The first thing that makes you more secure is UPDATING YOUR SOFTWARE. When you do not use RPMs, when you waste time compiling software, and you waste time UN-automating things, you have less time to keep things more secure. You have less time to research actual security measures like setting up an Intrusion Detection System or locking down your sudo rules.

A few months later we had a database disk fill up. A piece of our software that wrote to the database was able to detect this and send an email. Who wrote this software? You will never guess. It was written by grey beard developer. I asked old grey beard developer, "Wow! it is great that our software was able to detect the other machine had a full disk! How did one machine determine the other machine had a full disk?"

What do you think this grey beard developer told me? That he built an RPC mechanism in the software to expose disk information?

Of course f*ckin not!!

This self declared "security expert"  mother f*in hard coded username/password into his program, and his program was making ssh connections to the other machine to check if the disk was full!

To be clear, I was not allowed to install gcc in order to keep our system "secure" but he was sticking passwords in binaries, so machines could log into each other!

Life lessons: "Hackers" have bots, they do not do things by hand. If they get access to your machine they don't need gcc installed to continue attacking it. If you spend all day doing security through un-automation you are giving everyone else but yourself an easy job.

Other fun security through obscurity ideas:
"If your ssh server does not have a hostname it makes it harder for hackers to find it"

False. DNS severs can support a lookup like "give me all the hosts in a domain" but they do not. "Hackers" can not ask DNS for a list of hostnames, hackers simply use nessus to check for ssh servers on all IP addresses. By removing that host name you made your life harder.

"Long passwords are better for unix systems"

(half)False:Up until a few years ago most systems used crypt which actually only took the first few characters of a password for the hash. Meaning if you had a password of 8334343434 only the first 16 or something were needed to log in. Newer OS and distro's have better support for this.

NAT is more secure then direct routing.

False: You do not know how many times I have heard someone say "Ow shit! host_with_private_information has been accessible from the internet for 2 years."


Before host_with_private_information owned that IP someone else did and that was a public site.

With our without NAT you can setup an off by default network. With or without NAT if your ACL is opening up a port you may forget about it after you decommission and re-use. NAT was made to save IP address space. That is it.



Saturday Mar 08, 2014

Thrift isn't going anywhere. no more

Lately I have gotten the itch to add some features to cassandra. One thing I noticed while trolling through the souce code is that it is possible to perform multiple slices on a single row in one operation. This functionality was not directly accessible via thrift (not sure about the CQL state).

So thrift has a new method! get_multi_slice 

Imagine you have a row with columns named a-z

    private static void addTheAlphabetToRow(ByteBuffer key, ColumnParent parent)
            throws InvalidRequestException, UnavailableException, TimedOutException
        for (char a = 'a'; a <= 'z'; a++) {
            Column c1 = new Column();
            c1.setValue(new byte [0]);
            server.insert(key, parent, c1, ConsistencyLevel.ONE);

The thrift method get_slice allows you to select columns in only one range a-z, b-c, f-a. Suppose you want to select a-e and i-n in one operation but do not want the columns in the middle. The new method get_multi_slice to the rescue!

    public void test_multi_slice() throws TException
        ColumnParent cp = new ColumnParent("Standard1");
        ByteBuffer key = ByteBuffer.wrap("multi_slice_two_slice".getBytes());
        addTheAlphabetToRow(key, cp);
        MultiSliceRequest req = makeMultiSliceRequest(key);
        req.setColumn_slices(Arrays.asList(columnSliceFrom("a", "e"), columnSliceFrom("i", "n")));

        assertColumnNameMatches(Arrays.asList("a", "b", "c", "d", "e", "i", "j", "k" , "l", "m" , "n"), server.get_multi_slice(req));

In CQL-ish terms this gives you something to the equivalent of "select a-e, i-n from row where key=x"

The Thrift-Defence-League is not done yet! Coming soon option timestamps. Ow and some other stuff to crazy to be let out yet :)

Thursday Mar 06, 2014

I may not change the world but I will touch the mind that changes the world (Tupac quote)


Got this on linked in today. Awesome!


Inspired by your rant on state-of-the-YARN, your Hive book, and particularly your Big Data Travel Plan...

I was a chemical engineer doing "large datasets" in med school, and am now on the Big Data and Machine Learning bandwagon (without even knowing)! I too do my own plumbing!


Saturday Mar 01, 2014

Today I grouted a toilet and fixed a Cassandra bug!

Today I did this:

And this:


I feel very accomplished. I am a

Renaissance Man

Now, I go play x-box.

Thursday Feb 13, 2014

Introducing Now Its Time Compiler

Hey ! You know everyone can not wait for java 0.8 to get closures, but what if I told you you could have that stuff now? Like if you want to do something stupid fresh like pass a java array to a javascript method? Well no problem Nit (Now is Time) makes it easy! 

Say you want to..write a function with groovy like closure syntax from java...

  public void constructAClosure() throws NitException {
    NitDesc n = new NitDesc();
    n.setScript("{ tuple -> println(tuple); return 1 }");
    Closure c = NitFactory.construct(n);
    Assert.assertEquals(1, c.call("dude"));

Or write a method in clojure and call it from java

  public void constructAClojClosure() throws NitException {
    NitDesc n = new NitDesc();
    n.setScript("(ns user) (defn fil [a] (if (= a  \"4\" ) a ))");
    Var v = NitFactory.construct(n);
    Assert.assertEquals("4", v.invoke("4"));

or do server side javascript?

  public void constructAJavaScript() throws NitException {
    NitDesc n = new NitDesc();
    n.setScript("function over21(row) { if (row > 21) return true; else return false; }");
    Function f = NitFactory.construct(n);
    Context context = Context.enter();
    Scriptable scope = context.initStandardObjects();
    Assert.assertEquals(true, f.call(context, scope, scope, new Object[]{ 22 }));
    Assert.assertEquals(false, f.call(context, scope, scope, new Object[]{ 20 }));

Or just make reflection somewhat easier

  public void constuctClassWithConArgs() throws NitException {
    NitDesc n = new NitDesc();
    n.setConstructorArguments(new Object[] { "http", "teknek.io", "/some/cool/stuff" });
    URL u = NitFactory.construct(n);
    Assert.assertEquals("teknek.io", u.getHost());

You get the idea :)

What is next? Jruby, Jython,,java compiling inside java? Who knows?