アルゴリズム

How to bring columns not GROUP BY key from nested subqueries

When I extract data with a complex condition, my query is frequently have a deep nested sub query. Then columns which are not GROUP BY key are required at outside of a sub query. To make an analysis query easier, there is a technique which I frequently use. I’ll take a memo of the technique to avoid forgetting. I’ll write a query for ad system’s analysis query because I can imagine a nested query easily. There are 2 ad related data tables. “clicks” stores ad click data for each user. The other one is “purchases” which stores shop item purchase data for each user. Then I’ll extract how many times which type of ads contributes to user’s item purchase. Contributing means that letting a user to purchase item(last click before purchase). Here is a sample query including the technique.
SELECT
  SPLIT(last_c_info, ',')[6] AS ad_type,
  COUNT(*)
FROM (
  SELECT
    first_p_info,
    MAX(c_info) AS last_c_info
  FROM (
    SELECT
      c_info,
      MIN(p_info) AS first_p_info
    FROM (
      SELECT
        CONCAT_WS(
          ',',
          click_time,
          click_id,
          audience_id,  -- User ID
          publisher_id,  -- Media Owner(ex. Blogger) ID
          advertiser_id,  -- Shop ID
          ad_id,
          ad_type
        ) c_info
      FROM
        clicks
      WHERE
        click_time >= '2019-12-01 00:00:00'
        AND click_time < '2020-01-01 00:00:00'
    ) c
    LEFT OUTER JOIN (
      SELECT
        CONCAT_WS(
          ',',
          purchase_time,
          purchase_id,
          audience_id,
          advertiser_id,
          product_id  -- Item ID
        ) p_info
      FROM
        purchases
      WHERE
        purchase_time >= '2019-12-01 00:00:00'
        AND purchase_time < '2020-01-02 00:00:00'
    ) p
    ON (c.audience_id = p.audience_id)
    WHERE
      p.purchase_time IS NOT NULL
      AND p.purchase_time >= c.click_time
      AND unix_timestamp(p.purchase_time) < unix_timestamp(c
.click_time) + 24 * 3600)  -- max duration is 24 hours
    GROUP BY
      c_info
  ) first_purchases
) last_clicks
GROUP BY
  SPLIT(last_c_info, ',')[6];
I’ll explain only main point because this article is not to share ad system. The technique is the following.
CONCAT_WS(
  ',',
  click_time,
  click_id,
  audience_id,
  publisher_id,
  advertiser_id,
  ad_id,
  ad_type
) c_info

CONCAT_WS(
  ',',
  purchase_time,
  purchase_id,
  audience_id,
  advertiser_id,
  product_id
) p_info
If you make a group of columns with CONCAT_WS, you can use columns not GROUP BY key easily with SPLIT. MIN/MAX works correctly if you put group by key to the first. Of course, this technique has some demerits.
  • Data type becomes string after CONCAT_WS. So, you may need to do 0 padding for integer column.
  • You need to care secondary group by key every time.
  • You need to pick up a column with integer index from a grouped column. So, if you make a mistake with number, it’s a little difficult to do debug.
Another ideas except this technique are another JOIN and just bringing columns from deepest sub query. But, 1st one makes a long and complex query. If you need more columns from original table, you need to change many parts.
zuqqhi2

Share
Published by
zuqqhi2
Tags: hive