When I extract data with a complex condition, my query is frequently have a deep nested sub query. Then columns which are not GROUP BY key are required at outside of a sub query.
To make an analysis query easier, there is a technique which I frequently use. I’ll take a memo of the technique to avoid forgetting.
I’ll write a query for ad system’s analysis query because I can imagine a nested query easily. There are 2 ad related data tables. “clicks” stores ad click data for each user. The other one is “purchases” which stores shop item purchase data for each user. Then I’ll extract how many times which type of ads contributes to user’s item purchase. Contributing means that letting a user to purchase item(last click before purchase).
Here is a sample query including the technique.
SELECT
SPLIT(last_c_info, ',')[6] AS ad_type,
COUNT(*)
FROM (
SELECT
first_p_info,
MAX(c_info) AS last_c_info
FROM (
SELECT
c_info,
MIN(p_info) AS first_p_info
FROM (
SELECT
CONCAT_WS(
',',
click_time,
click_id,
audience_id, -- User ID
publisher_id, -- Media Owner(ex. Blogger) ID
advertiser_id, -- Shop ID
ad_id,
ad_type
) c_info
FROM
clicks
WHERE
click_time >= '2019-12-01 00:00:00'
AND click_time < '2020-01-01 00:00:00'
) c
LEFT OUTER JOIN (
SELECT
CONCAT_WS(
',',
purchase_time,
purchase_id,
audience_id,
advertiser_id,
product_id -- Item ID
) p_info
FROM
purchases
WHERE
purchase_time >= '2019-12-01 00:00:00'
AND purchase_time < '2020-01-02 00:00:00'
) p
ON (c.audience_id = p.audience_id)
WHERE
p.purchase_time IS NOT NULL
AND p.purchase_time >= c.click_time
AND unix_timestamp(p.purchase_time) < unix_timestamp(c
.click_time) + 24 * 3600) -- max duration is 24 hours
GROUP BY
c_info
) first_purchases
) last_clicks
GROUP BY
SPLIT(last_c_info, ',')[6];
I’ll explain only main point because this article is not to share ad system. The technique is the following.
CONCAT_WS(
',',
click_time,
click_id,
audience_id,
publisher_id,
advertiser_id,
ad_id,
ad_type
) c_info
CONCAT_WS(
',',
purchase_time,
purchase_id,
audience_id,
advertiser_id,
product_id
) p_info
If you make a group of columns with CONCAT_WS, you can use columns not GROUP BY key easily with SPLIT. MIN/MAX works correctly if you put group by key to the first. Of course, this technique has some demerits.
- Data type becomes string after CONCAT_WS. So, you may need to do 0 padding for integer column.
- You need to care secondary group by key every time.
- You need to pick up a column with integer index from a grouped column. So, if you make a mistake with number, it’s a little difficult to do debug.
Another ideas except this technique are another JOIN and just bringing columns from deepest sub query. But, 1st one makes a long and complex query. If you need more columns from original table, you need to change many parts.
No comments yet.