I’ll show you how to achieve the task which I mentioned in the title by hive. rank function is important to do it. If you don’t use rank, you have to do the following steps to solve the task.
  1. Get the list of categories.
  2. Issue SELECT queries to get top n items for each category(SELECT … FROM … ORDER BY … LIMIT n)
  3. Arrange output if it’s needed
When you use rank function, you need to issue SELECT query only 1. Query image is like following.
SELECT
  *
FROM (
  SELECT
    *,
    rank() over (PARTITION BY category_id ORDER BY quantity DESC) as rank
  FROM
    sample_table
) a
WHERE
  rank < n;
After this, I’ll show you a sample.

Environment

Software versions which I used in this time are following.
  • Hadoop
    • 2.6.0.2.2.0.0-2041
  • hive
    • 0.14.0.2.2.0.0-2041
  • tez
    • 0.5.2.2.2.0.0-2041

Prepare test data

I’ll prepare test data like following. Delimiter is tab and each columns represent are item id, genre id, item name and number of purchase.
1       a       book1   100
2       b       fashion1        10
3       a       book2   1000
4       c       food1   1
5       c       food2   10
6       b       fashion2        100
7       d       medicine1       100
8       b       fashion3        1
9       b       fashion4        10
10      b       fashion5        1
11      c       food3   100
12      a       book3   10
13      a       book4   1
I’ll insert test data to new table.
$ hive
use test;

CREATE TABLE test_item_order_data(
  item_id   bigint,
  genre_id  string,
  item_name string,
  quantity  bigint
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH "/path/to/test data/test.dat" OVERWRITE INTO TABLE test_item_order_data;

SELECT * FROM test_item_order_data;
1       a       book1   100
2       b       fashion1        10
3       a       book2   1000
4       c       food1   1
5       c       food2   10
6       b       fashion2        100
7       d       medicine1       100
8       b       fashion3        1
9       b       fashion4        10
10      b       fashion5        1
11      c       food3   100
12      a       book3   10
13      a       book4   1

Run query with rank

Query with rank function should be like following. In this time, the query output top 3 items for each genres. I’ll use Tez in the code, but Tez is not must.
set hive.execution.engine=tez;
SELECT
  *,
  rank() over (PARTITION BY genre_id ORDER BY quantity DESC) as rank
FROM
  test_item_order_data;

3       a       book2   1000    1
1       a       book1   100     2
12      a       book3   10      3
13      a       book4   1       4
6       b       fashion2        100     1
9       b       fashion4        10      2
2       b       fashion1        10      2
8       b       fashion3        1       4
10      b       fashion5        1       4
11      c       food3   100     1
5       c       food2   10      2
4       c       food1   1       3
7       d       medicine1       100     1

SELECT
  item_id,
  genre_id,
  item_name,
  quantity
FROM (
  SELECT
    *,
    rank() over (PARTITION BY genre_id ORDER BY quantity DESC) as rank
  FROM
    test_item_order_data
) io
WHERE io.rank < 3;

3       a       book2   1000
1       a       book1   100
6       b       fashion2        100
9       b       fashion4        10
2       b       fashion1        10
11      c       food3   100
5       c       food2   10
7       d       medicine1       100
As you can see the result, when there are 2 2nd items, 3rd item should be rank 4.
zuqqhi2

Share
Published by
zuqqhi2
Tags: hive