Edward Capriolo

Saturday Jul 19, 2014

Travis CI is awesome!

https://travis-ci.org/edwardcapriolo/teknek-core

Travis CI is awesome... That is all.

Thursday Jul 03, 2014

MapReduce on Cassandra's sstable2json backups

I was talking to a buddy about having nothing to do today. He said to me, "You know what would be awesome? We have all these sstable2json files in s3 and it would be cool if we could map reduce them."

For those not familiar sstable2json makes files like this:

[
{"key": "62736d697468","columns": [["6c6173746e616d65","736d697468",1404396845806000]]},
{"key": "6563617072696f6c6f","columns": [["66697273746e616d65","656477617264",1404396708566000], ["6c6173746e616d65","63617072696f6c6f",1404396801537000]]}
]

Now. There already exists a json hive serde. https://github.com/rcongiu/Hive-JSON-Serde, however there is a small problem.

That serde expects data to look like this:

{}
{}

Not like this:

[
{},
{}
]

What is a player to do? Make a custom input format that is what:

The magic is in a little custom record reader that skips everything except what the json serde wants and trims trailing commas.

  @Override
  public synchronized boolean next(LongWritable arg0, Text line) throws IOException {
    boolean res = super.next(arg0, line);
    if (line.charAt(0) == '['){
      res = super.next(arg0, line);
    }
    if (line.charAt(0) == ']'){
      res = super.next(arg0, line);
    }
    if (line.getLength() > 0 && line.getBytes()[line.getLength()-1]==','){  
      line.set( line.getBytes(),0, line.getLength()-1);
    }
    if (res == false){
      return false;
    } 
    return res;
  }

Next, we create a table using the JSON serde and the input format from above.

hive> show create table json_test1;                                                         
OK
CREATE  TABLE json_test1(
  key string COMMENT 'from deserializer',
  columns array<array<string>> COMMENT 'from deserializer')

ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
  'io.teknek.arizona.ssjsoninputformat.SSTable2JsonInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'file:/user/hive/warehouse/json_test1'
TBLPROPERTIES (
  'numPartitions'='0',
  'numFiles'='1',
  'transient_lastDdlTime'='1404408280',
  'numRows'='0',
  'totalSize'='249',
  'rawDataSize'='0')

When we use these together we get:

hive> SELECT key , col FROM json_test1 LATERAL VIEW explode (columns) colTable as col;
62736d697468    ["6c6173746e616d65","736d697468","1404396845806000"]
6563617072696f6c6f    ["66697273746e616d65","656477617264","1404396708566000"]
6563617072696f6c6f    ["6c6173746e616d65","63617072696f6c6f","1404396801537000"]
Time taken: 4.704 seconds, Fetched: 3 row(s)

Winning! Now there are some things to point out here:

  1. sstable2json with replication N is going get N duplicates that you will have to filter yourself. (maybe it would be nice to build a feature in sstable2json that only dumps the primary range of each node?)
  2. Your probably going to need a group and a window function to remove all but the last entry (dealing with overwrites and tombstones)

But whatever, I just started playing with this this morning. I do not have time to sort out all the details. (maybe you don't have updates and this is not a big deal for you).

Tuesday Jul 01, 2014

Next hadoop enterprise pissing match beginning

http://hortonworks.com/blog/cloudera-just-shoot-impala/

"I hold out hope that their interests in enabling Hive on Spark are genuine and not part of some broader aspirational marketing campaign laced with bombastic FUD."

I really think horton is being FUDLY here. Cloudera has had 1-2 people involved with the hive project for a while now. Maybe like 6+ years. Carl is the hive lead, previously he worked for Cloudera. Cloudera has 2 people now adding features one is Brock Noland who is doing an awesome job.

Hortonworks is relatively new to the hive project. 2-3 years tops? (not counting people who did work for hive before joining horton)

So, even though cloudera did build impala (and made some noise about it being better then hive), they have kept steady support on the hive project for a very long time.

Spark is just very buzzy now. Everyone wants to have it, or be involved with it, like "cloud", but spark is actually 3-4 years old right?

But it is really great to see spark. Everyone wants to have it, and the enterprise pissing matches are starting! Sit back and watch the fun! Low blows coming soon!

Previous pissing matches: 

  1. Who has the best hadoop distro?
  2. Who "leads" the community?
  3. Parquet vs ORC?
  4. Who got the "credit" for hadoop security and who did "all the work"



Monday Jun 23, 2014

Server Side Cassandra

http://www.xaprb.com/blog/2014/06/08/time-series-database-requirements/

"Ideally, the language and database should support server-side processing of at least the following, and probably much more"

A co-worker found this. I love it. Sounds JUST like what I am trying to implement in:

https://issues.apache.org/jira/browse/CASSANDRA-6704

and what we did implement in https://github.com/zznate/intravert-ug .

How does that saying go? First they ignore you...

 


Tuesday Jun 17, 2014

Cloudera desktop manager forces you to disable SELINUX

This is a very curious thing. When trying to install cdh I found that it forces me to disable SELINUX completely. I can understand why an installed would have problems, but why wont it allow me to do the install in 'permissive' mode? Then I would be able to see the warnings.

This is kinda ##SHADEY##. Normally I do not give a shit about selinux, but being forced to have it completely disabled?

Monday Jun 09, 2014

Wait...Say what.

http://www.datastax.com/dev/blog/4-simple-rules-when-using-the-datastax-drivers-for-cassandra

Cassandra’s storage engine is optimized to avoid storing unnecessary empty columns, but when using prepared statements those parameters that are not provided result in null values being passed to Cassandra (and thus tombstones being stored). Currently the only workaround for this scenario is to have a predefined set of prepared statement for the most common insert combinations and using normal statements for the more rare cases.

So what your saying is ... if I don't specify a column when I insert, I delete it?

Saturday May 17, 2014

The important lesson of functional programming

I wanted to point something out: Many times I hear people going on and on about functional programming, how java can't be good without function passing (functors), how lambda features are massively important, or ivory tower talk about how terrible the 'kingdom of nouns" is.

Let us look at Wikipedia's definition of functional programming.
----

In computer science, functional programming is a programming paradigm, a style of building the structure and elements of computer programs, that treats computation as the evaluation of mathematical functions and avoids state and mutable data. 

---

Though hipsters and 'kingdom of verb' fan boys will go on and on about lamdas, anonymous inner functions, and programs that have so many callbacks you need an api to un roll the callbacks into something readable,  the important part of functional programming (to me) is avoiding state and mutable data, and you can leverage that concept from any language that has a method (function)!

Removing state has big benefits. One is repeatability this brings testability. I enjoy writing code that is easily testable without mocking or a writing large test harness.

Here is an example. I am currently working on a teknek feature to coordinate how many instances of a process run on a cluster of nodes. At first you may think this problem is not a functional problem, because depends on the state of local threads, as well as a cluster state that is stored in zookeeper. Let's look at an implementation:

---

  private boolean alreadyAtMaxWorkersPerNode(Plan plan){
List<String> workerUuids = null;
try {
workerUuids = WorkerDao.findWorkersWorkingOnPlan(zk, plan);
} catch (WorkerDaoExecption ex) {
return true;
}
    if (plan.getMaxWorkersPerNode() == 0){
      return false;
    }
    int numberOfWorkersRunningInDaemon = 0;
    List<Worker> workingOnPlan = workerThreads.get(plan);
    if (workingOnPlan == null){
      return false;
    }
    for (Worker worker: workingOnPlan){
      if (worker.getMyId().toString().equals(workerUuids)){
        numberOfWorkersRunningInDaemon++;
      }
    }
    if (numberOfWorkersRunningInDaemon >= plan.getMaxWorkersPerNode()){
      return true;
    } else {
      return false;
    }
  }

---

Worker threads is a member variable, another method uses a data access object, and the method is called from 'deep' inside a stateful application.

There is a simple way to develop this feature and still have great test coverage. Eliminate state! Functional Programming! Write methods that are functional, methods that return the same output always based on inputs.

Let's pull everything not functional out of the method and see what marvellous things this does for us!

---

  @VisibleForTesting
  boolean alreadyAtMaxWorkersPerNode(Plan plan, List<String> workerUuids, List<Worker> workingOnPlan){
    if (plan.getMaxWorkersPerNode() == 0){
      return false;
    }
    int numberOfWorkersRunningInDaemon = 0;
    if (workingOnPlan == null){
      return false;
    }
    for (Worker worker: workingOnPlan){
      if (worker.getMyId().toString().equals(workerUuids)){
        numberOfWorkersRunningInDaemon++;
      }
    }
    if (numberOfWorkersRunningInDaemon >= plan.getMaxWorkersPerNode()){
      return true;
    } else {
      return false;
    }
  }

---

Look Ma! No state! All the state is in the caller!

---

  private void considerStarting(String child){
    Plan plan = null;
    List<String> workerUuidsWorkingOnPlan = null;
    try {
      plan = WorkerDao.findPlanByName(zk, child);
      workerUuidsWorkingOnPlan = WorkerDao.findWorkersWorkingOnPlan(zk, plan);
    } catch (WorkerDaoException e) {
      logger.error(e);
      return;
    }
    if (alreadyAtMaxWorkersPerNode(plan, workerUuidsWorkingOnPlan, workerThreads.get(plan))){
      return;
    }

---

Why is removing state awesome? For one it makes Test Driven Development easy. Hitting this condition with an integration test is possible but it involves a lot of effort and hard to coordinate timing. Since we removed the state look how straight forward the test is.

---

  @Test
  public void maxWorkerTest(){
    Plan aPlan = new Plan().withMaxWorkersPerNode(0).withMaxWorkers(2);
    Worker workingOn1 = new Worker(aPlan, null, null);
    Worker workingOn2 = new Worker(aPlan, null, null);
    List<String> workerIds = Arrays.asList(workingOn1.getMyId().toString(), workingOn2.getMyId().toString());
    List<Worker> localWorkers = Arrays.asList(workingOn1,workingOn2);
    Assert.assertFalse(td.alreadyAtMaxWorkersPerNode(aPlan, workerIds, localWorkers));
    aPlan.setMaxWorkersPerNode(2);
    Assert.assertTrue(td.alreadyAtMaxWorkersPerNode(aPlan, workerIds, localWorkers));
  }

---

Holy crap! The bolded line failed the assert! Remember "testing reveals the presence of bugs not the absence". Bugs should be easy to find an fix now that the logic is not buried deep. In fact, I easily stepped this code and found out the problem.

---

 if (worker.getMyId().toString().equals(workerUuids)){

---

In Java is is not a syntax error to call String.equals(List). It always returns false! DOH. Without good testing we may not have even found this bug. Win. Lets fix that.

---

 for (Worker worker: workingOnPlan){
for (String uuid : workerUuids ) {
        if (worker.getMyId().toString().equals(uuid)){
          numberOfWorkersRunningInDaemon++;
        }
    }
 }

---

Now, lets use our friend cobertura to see what our coverage looks like. (If your not familiar with coverage tests get acquainted fast.  Cobertura is awesome sauce! It runs your tests and counts how many times each branch and line of code was hit! This way you see where you need to do more testing!)

[edward@jackintosh teknek-core]$ mvn cobertura:cobertura


Pretty good! We can see many of our cases are covered and we can write a few more tests to reach 100%. That is just academic at this point. Anyway tests are great. I think of tests like tripwire against future changes, and assurance that the project does what it advertises.

Anyway the big take away is functional programming is possible from a "non functional" language. Functional programming makes it easy to build and tests applications. And as always, anyone that does not write tests should be taken out back and beaten with a hose.

For those interested you can see the entire commit here.

Saturday Apr 26, 2014

Implementing AlmostSQL, dream of the 2010 era

2010 will be the generation defined by a bunch of tech geeks trying as hard as possible to re-invent SQL and falling short over and over again. NoSQL was novel because it took the approach of not attempting to implement all the things hard to implement in a distributed way. Even without full SQL (or any SQL) these system are still very useful for a variety of tasks....

But just being useful for some tasks is not good enough. The twitter tech universe requires you to be good at EVERYTHING to get VC. So even if you are not good at it or can't do it, just pretend like you can. Mislead or just lie!

The latest example of this phenomenon is http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html .

You read and you think. "Ow shit! This is awesome. SQL!" Finally the dream is realized fast distributed SQL. SALVATION!, retweet that!

But before you tweet your grandma over what nosql she should now use to count the eggs in her fridge, read the docs reading about this "SQL" support.
http://spark.apache.org/docs/0.9.1/scala-programming-guide.html

Note that Spark SQL currently uses a very basic SQL parser. Users that want a more complete dialect of SQL should look at the HiveQL support provided by HiveContext.

OMG. WHAT THE FUCK IS THIS POINTLESS OBSESSION WITH CALLING HALF ASSED SQL LANGUAGES SQL?

This has come up on the Hive mailing list and before, the conversation "Why don't we rename it from HiveQL to Hive SQL?"

I will tell you why...

Cause SQL IS A STANDARD! . I'm tired of every piece of tech that wants a press release to go viral that just throws the word SQL in there when they DONT ACTUALLY DO SQL. You can't put lipstick on a pig (or the pig language). Supporting 20% of the SQL features does NOT make you an SQL system! Neither does building a scala DSL!

To add insult to injury, why not go on about how much better your "SQL" implementation is implemented:

http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html


"The Catalyst framework allows the developers behind Spark SQL to rapidly add new optimizations, enabling us to build a faster system more quickly. In one recent example, we found an inefficiency in Hive group-bys that took an experienced developer an entire weekend and over 250 lines of code to fix; we were then able to make the same fix in Catalyst in only a few lines of code."

By the way, not saying spark sql is not interesting, cool, novel, but don't go out bragging about how "Hard it is for hive to fix group bys" and how "easy it is with catalyst" when your basically not even implementing a full query language, or even half of one.

It is not surprising that a less complete query language has less lines of code then a more complete one!

Not CQL, SQL

As you may know I have a quixotic quest is to have non-fork support for server side operations in cassandra.

See https://issues.apache.org/jira/browse/CASSANDRA-5184 for details on how this is going no where.

My latest take is pretty cool (to me anyway). Cassandra is made for slicing http://wiki.apache.org/cassandra/API10. Call it 'Legacy API' if you want but the ugly truth is all the CQL stuff is still build on top of slice. To be clear EVEN when you select 'some columns' it still calls a massive getSlice() that returns all the columns.

Anyway here is my latest brilliant idea. What if we make Cassandra and H2 have a baby? We let Cassandra do what it is good at, Slicing. The with the results of the slice we allow a 'sub select' of that slice using FULL SQL? In other words:
If your first slice/query fits into memory, then load it into h2, then do any sql query on that data only!

Something like this:

     Connection conn = o.getMutationTool().runSliceAndLoad("stats", Arrays.asList("dateAsRowKey", "1970-12-31"), 
              Arrays.asList("vertical", "page", "toys"),
              new int [] { CompositeTool.NORMAL, CompositeTool.NORMAL, CompositeTool.NORMAL},
              Arrays.asList("vertical", "page", "toys"),
              new int [] { CompositeTool.NORMAL, CompositeTool.NORMAL, CompositeTool.INCLUSIVE_END}
             );
      ResultSet rs = conn.createStatement().executeQuery("SELECT sum(value) from data");
      if (rs.next()){
        System.out.println(rs.getLong(1));
      }
      rs.close();
      conn.close();

 

Did I just blow your mind? Wouldn't that be amazeballs? Well don't worry, Arizona will have it, but for now I give you the secret sauce!

Step 1: h2 in memory, do astaynax slice

   conn = DriverManager.getConnection("jdbc:h2:mem:;LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=4096");
result = keyspace
              .prepareQuery(stats)
              .getKey(CompositeTool.makeComposite(CompositeTool.stringArrayToByteArray(rowKey)))
              .withColumnRange(
                      CompositeTool.makeComposite(
                              CompositeTool.stringArrayToByteArray(startColumn), startRange),
                      CompositeTool.makeComposite(CompositeTool.stringArrayToByteArray(endColumn),
                              finishRange), false, 10000).execute().getResult();

Step 2: Create in memory h2 table for data

   sb.append("CREATE TEMPORARY TABLE data (");
   for (int j = 0; j < unwrap.size() / 2; j++) {
   sb.append(new String(unwrap.get(j))).append(" VARCHAR(255) ,");
   }
   sb.append("value bigint ");
   sb.append(")");

Step 3: load data from cassandra into h2

     ps = conn.prepareStatement("insert into data VALUES (?,?,?)");
     for (int j = unwrap.size() / 2, k = 0; j < unwrap.size(); j++, k++) {
        ps.setString(k + 1, new String(unwrap.get(j)));
     }
     ps.setLong(3, l);
     ps.execute();


Step 4: return Connection to user so they can do WHATEVER QUERY THEY WANT!

      ResultSet rs = conn.createStatement().executeQuery("SELECT sum(value) from data");
      if (rs.next()){
        System.out.println(rs.getLong(1));
      }
      rs.close();
      conn.close();

Step 5. Winning

Use slicing to create "primary dimension" then query the heck out of it. Any way your little heart desires.

Sunday Apr 13, 2014

21 hour work weeks

http://www.policymic.com/articles/87465/why-we-all-shared-the-story-about-france-s-alleged-ban-on-after-work-e-mails

Sorry, I can not agree, the world is getting soft. I do not believe you should work yourself to death or bad health or anything like that. But if you like what you do, it is not really work,....I do not really count how much "work" I do a week as a way of showing off or anything, but like hey when I'm not "working" I might still be writing code, or reading about something that will help me work better.

I don't remember anyone in my family even liking the idea of not working hard...Maybe its an italian worker thing :)


 

Wednesday Apr 09, 2014

Oracle refuses to let me download old JVM

I have a java project that will not build with open JDK. My machine has jdk 1.7 and the target platform is java 1.6. I say to myself, "Hey no problem Ill just download an older jdk"

So I go to oracle.com which forced me to sign up for an oracle account. After I sign up I click the link to download jdk...

 

God fucking kill yourself...All you "open source" f*ckers...Kill yourselves.

Tuesday Apr 08, 2014

Todays moment of CQL zen

[edward@jackintosh Downloads]$ netstat -nl | grep 9160
tcp        0      0 127.0.0.1:9160          0.0.0.0:*               LISTEN     
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 2.0.4

The CLI is deprecated and will be removed in Cassandra 3.0.  Consider migrating to cqlsh.
CQL is fully backwards compatible with Thrift data; see http://www.datastax.com/dev/blog/thrift-to-cql3

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] exit;
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh
Connection error: Could not connect to localhost:9160

[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh 127.0.0.1 9160
Connection error: Could not connect to 127.0.0.1:9160
[edward@jackintosh Downloads]$ /home/edward/.farsandra/apache-cassandra-2.0.4/bin/cqlsh localhost 9160
Connection error: Could not connect to localhost:9160


Sunday Apr 06, 2014

The problem of making things too easy....

http://www.quantschool.com/home/programming-2/comparing_inmemory_data_stores/cassandra
-------------------------------

Since assetClass was not added as an index, it was neccessary to add "allow filltering" when creating the CQL query

SELECT data from marketdata.markets where assetClass=? allow filtering

The caveat being, not all data is guaranteed to be returned (which made Cassandra the least favorite of the system I analyzed, not to mention the performance issues encountered when query across the full One million population set)

------------------------

I seem to recall a blog I wrote:

"I do not like how CQL uses the terms 'PRIMARY KEY' and 'TABLE'. These are terms from the 'System'. If you do not understand what a Column Family is you are blind to the truth, you will not be able to design an effective Cassandra data store."

Obviously this user was under the impression that if Cassandra is providing a query language, that the query language would be efficient. That is not the case. "Allow filtering" should NEVER be used in production. It is a toy given to CLI users for 10 row datasets.

The reason I always prefer thift is that it FORCES you to design effectively, it FORCES you to know what your doing. It is why I don't like calling 'rows' 'partitions' or calling 'columns' 'cells' or calling column families 'tables'. Screw the people that can't RTFM, cql won't save them anyway :)

comparing in memory data stores....

Just putting it out there. Cassandra is not likely to be happy with less then 2 GB heap. What is the point of doing a "benchmark" when you are not very clear on how to use the technology?

------------

http://www.quantschool.com/home/programming-2/comparing_inmemory_data_stores

The excel spreadsheet with the full data results is attached below.

Please note that in these results, Cassandra is shown in red because it repeatedly failed when executing the query against one million trades

I did various searches for this and the suggested remedies were confusing. One posting recommended an increase in JVM size. For some reason that was beyond me, increasing the JVM size from 1GB to even 1.25GB meant that the server would not start.

Another post suggested reducing JVM size to keep GC's low.

A whole hosts of other posts were on why people had moved away from Cassandra, then others on exhaustive JVM tuning parameters to get it going (life's too short for that!)

On the whole I would like to discount Cassandra as I had to reduce the population size to 500,000 in order for it to work. But having made the effort, I decided to include it but mark it out as red.

Please infer your own conclusions from this (I have!)

-------------------------------

Yes! Thank you for providing 5 meaningless bars! ARge headache! Let me provide some results for you:

mongo:        XXXXX
cassandra:  XXXXXXXXX
fishsticks:   XXXXXXXXXX

As you can see by my exhaustive benchmark shows fishsticks has far more letters then mongo.

Wednesday Apr 02, 2014

Freedom is over

"Open" source projects deny my commits, "social" networks won't let me post things.

It is just time to call it quits everyone.

Calendar

Feeds

Search

Links

Navigation