Edward Capriolo

Wednesday Feb 17, 2016

Python users / Data Scientists measuring PITA levels

Before I get started trashing people me say I have the greatest respect for former and current colleges, but there is a large looming problem that needs to be addressed.

The fanboy level of Python usage in people, mainly data scientists, needs to stop.

A sick blind devotion to python complete unchecked by reason

I was talking to a Python user about Spark: 
Me: "What were you looking to use spark for"
Them: I hear there is PySpark
Me: Yes very interesting, what are you looking to use it for,
Them: PySpark 

ROFL: The only take away about the spark platform is PySpark? Nothing else seemingly was interesting or caught your attention? Really nothing about streaming or in memory processing, just PySpark? lol #blinders

Your would think [data] scientist want to learn things?

I encounter this debate mostly with hive-streaming. When someone asks me about hive streaming I look at the problem. Admittedly there are actually a couple of tasks most easily addressed with streaming. But the majority of streaming things can be solved much more efficiently and correctly by writing a simple UDF UDAF in Java. What normally is a common reply when a Hive Committer, who wrote a book on hive, explains unequivocally  that a UDF is better for performance, debugging, test ability, and is not that hard to write?

"I don't want learn how to compile things | learn about java | learn about what you think is the right way to do things", You would think that a data scientist who is trying to search for great truths would actually want to find the best way to use a tool they have been working with for years.

Just to note: In hive streaming everything moving in between processes via pipes and is like 4 context switches and two serializations for each row (not including the processing that has to happen in the pipe). 

I don't care that 100% of the environment is Java, im f*ckin special

A few years back someone (prototyping in python) suggested we install LibHDFS. later someone suggested we install WebHDFS. The only reason to install these things is they must use python to do things, even if there already is prior examples of doing this exact task in java in our code base. Sysadmins should install new libraries, open new ports, monitor new services, and we should change our architecture, just because the python user does want to use Java for a task that 10 previous people have used java for. 

"I'm Just prototyping"

This is the biggest hand waiver. When scoping out a new project don't bother looking for the best tool for the job. Just start hacking away at something and then whatever type of monstrosity appears, just say its already done, someone will just have you jam it into production anyway. Good lucky supporting the "prototype" with no unit tests in production for next 4 years. You would think that someone would take lead from a professional coder and absorb their best practices. No of course not, they instead will just tell you how best practices don't apply to them.#ThisISSparta!

Anyway its 7:00 am and I woke up to write this so that I can vent. But yea its not python, its not data scientists, but there is just a hybrid intersection of the two that is so vexing. 



Post a Comment:
Comments are closed for this entry.