Edward Capriolo

Saturday May 05, 2012

Rank with hive

So I wanted to do a few things.

1) Promote a cool open source edition to hive by m6d (Rank UDF)
2) Promote the upcoming "Programming Hive" book (that I am co-authoring)

What better way then to give a preview of the m6d rank case study in the programming hive book?

--M6d UDF Pseudo Rank--

by David Ha and Rumit Patel

Sorting data and identifying the top n elements is straightforward. You order the whole data set by some criteria and limit the result set to n. But there are times when you need to group like elements together and find the top n elements within that group only. For example, identifying the top ten requested songs for each recording artist or the top 100 best selling items per product category and country. Several database platforms define a rank() function that can support these scenarios, but until Hive provides an implementation, we can create a user-defined function to produce the results we want. We will call this function p_rank() for psuedo-rank, leaving the name rank() for the Hive implementation.

Say we have the following product sales data and we want to see the top three items per category and country:

|===============================
|category       |country        |product        |sales
|movies |us     |chewblanca     |100
|movies |us     |war stars iv   |150
|movies |us     |war stars iii  |200
|movies |us     |star wreck     |300
|movies |gb     |titanus        |100
|movies |gb     |spiderella     |150
|movies |gb     |war stars iii  |200
|movies |gb     |war stars iv   |300
|================================

In most SQL systems :
[source,sql]
--------------------------
SELECT
 category,country,product,sales,rank
FROM (
 SELECT
   category,country,product, sales,
   rank() over (PARTITION BY category, country ORDER BY sales DESC) rank
 FROM p_rank_demo) t
WHERE rank <= 3
-----------------------------

To achieve the same result using HiveQL, the first step is partitioning the data into groups, which we can achieve using the distribute by clause. We must ensure that all rows with the same category and country are sent to the same reducer.

[source,sql]
--------------------------
DISTRIBUTE BY
 category,
 country
-------------------------

The next step is ordering the data in each group by descending sales using the sort by clause. While order by effects a total ordering across all data, sort by affects the ordering of data on a specific reducer. You must repeat the partition columns named in the distribute by clause.

[source,sql]
--------------------------
SORT BY
 category,
 country,
 sales DESC
--------------------------
Putting everything together, we have:

[source,sql]
--------------------------
ADD JAR p-rank-demo.jar;
CREATE TEMPORARY FUNCTION p_rank AS 'demo.PsuedoRank';

SELECT
 category,country,product,sales,rank
FROM (
 SELECT
   category,country,product,sales,
   p_rank(category, country) rank
 FROM (
   SELECT
     category,country,product,
     sales
   FROM p_rank_demo
   DISTRIBUTE BY
     category,country
   SORT BY
     category,country,sales desc) t1) t2
WHERE rank <= 3
----------------------------------

There is a little more to the case study but I have to save something for the book... The code is here:

https://github.com/edwardcapriolo/hive-rank

Enjoy!


Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Feeds

Search

Links

Navigation