I’ll show you how to achieve the task which I mentioned in the title by
hive. rank function is important to do it. If you don’t use rank, you have to do the following steps to solve the task.
- Get the list of categories.
- Issue SELECT queries to get top n items for each category(SELECT … FROM … ORDER BY … LIMIT n)
- Arrange output if it’s needed
When you use
rank function, you need to issue SELECT query only 1. Query image is like following.
SELECT
*
FROM (
SELECT
*,
rank() over (PARTITION BY category_id ORDER BY quantity DESC) as rank
FROM
sample_table
) a
WHERE
rank < n;
After this, I’ll show you a sample.
Environment
Software versions which I used in this time are following.
Prepare test data
I’ll prepare test data like following. Delimiter is tab and each columns represent are item id, genre id, item name and number of purchase.
1 a book1 100
2 b fashion1 10
3 a book2 1000
4 c food1 1
5 c food2 10
6 b fashion2 100
7 d medicine1 100
8 b fashion3 1
9 b fashion4 10
10 b fashion5 1
11 c food3 100
12 a book3 10
13 a book4 1
I’ll insert test data to new table.
$ hive
use test;
CREATE TABLE test_item_order_data(
item_id bigint,
genre_id string,
item_name string,
quantity bigint
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INPATH "/path/to/test data/test.dat" OVERWRITE INTO TABLE test_item_order_data;
SELECT * FROM test_item_order_data;
1 a book1 100
2 b fashion1 10
3 a book2 1000
4 c food1 1
5 c food2 10
6 b fashion2 100
7 d medicine1 100
8 b fashion3 1
9 b fashion4 10
10 b fashion5 1
11 c food3 100
12 a book3 10
13 a book4 1
Run query with rank
Query with rank function should be like following. In this time, the query output top 3 items for each genres. I’ll use Tez in the code, but Tez is not must.
set hive.execution.engine=tez;
SELECT
*,
rank() over (PARTITION BY genre_id ORDER BY quantity DESC) as rank
FROM
test_item_order_data;
3 a book2 1000 1
1 a book1 100 2
12 a book3 10 3
13 a book4 1 4
6 b fashion2 100 1
9 b fashion4 10 2
2 b fashion1 10 2
8 b fashion3 1 4
10 b fashion5 1 4
11 c food3 100 1
5 c food2 10 2
4 c food1 1 3
7 d medicine1 100 1
SELECT
item_id,
genre_id,
item_name,
quantity
FROM (
SELECT
*,
rank() over (PARTITION BY genre_id ORDER BY quantity DESC) as rank
FROM
test_item_order_data
) io
WHERE io.rank < 3;
3 a book2 1000
1 a book1 100
6 b fashion2 100
9 b fashion4 10
2 b fashion1 10
11 c food3 100
5 c food2 10
7 d medicine1 100
As you can see the result, when there are 2 2nd items, 3rd item should be rank 4.