hivesql练习_会话划分问题

发布时间 2023-04-02 17:50:00作者: 娜娜娜娜小姐姐

现有页面浏览记录表(page_view_events)如下,表中有每个用户的每次页面访问记录。

user_idpage_idview_timestamp
100 home 1659950435
100 good_search 1659950446
100 good_list 1659950457
100 home 1659950541
100 good_detail 1659950552
100 cart 1659950563
101 home 1659950435
101 good_search 1659950446
101 good_list 1659950457
101 home 1659950541
101 good_detail 1659950552
101 cart 1659950563
102 home 1659950435
102 good_search 1659950446
102 good_list 1659950457
103 home 1659950541
103 good_detail 1659950552
103 cart 1659950563

规定若同一用户的相邻两次访问记录时间间隔小于60s,则认为两次浏览记录属于同一会话。现有如下需求,为属于同一会话的访问记录增加一个相同的会话id字段,会话id格式为"user_id-number",其中number从1开始,用于区分同一用户的不同会话,期望结果如下:

user_id
<int>
(用户id)
page_id
<string>
(页面id)
view_timestamp
<bigint>
(浏览时间戳)
session_id
<string>
(会话id)
100 home 1659950435 100-1
100 good_search 1659950446 100-1
100 good_list 1659950457 100-1
100 home 1659950541 100-2
100 good_detail 1659950552 100-2
100 cart 1659950563 100-2
101 home 1659950435 101-1
101 good_search 1659950446 101-1
101 good_list 1659950457 101-1
101 home 1659950541 101-2
101 good_detail 1659950552 101-2
101 cart 1659950563 101-2
102 home 1659950435 102-1
102 good_search 1659950446 102-1
102 good_list 1659950457 102-1
103 home 1659950541 103-1
103 good_detail 1659950552 103-1

 

解答:

 1 with tt as(
 2   SELECT
 3   user_id,page_id,view_timestamp,
 4   lag(view_timestamp, 1, 0) over(partition by user_id order by view_timestamp) last_view_timestamp -- 获取上次一访问页面时间,默认值设为0
 5   FROM
 6   page_view_events
 7 )
 8 select
 9 user_id, page_id, view_timestamp,
10 concat(user_id, "-", sum(if(view_timestamp - last_view_timestamp > 60, 1, 0)) over(partition by user_id order by view_timestamp)) session_id -- 一次新会话,相邻两次访问时间大于60s。统计截至到当前时间共有几次会话
11 from tt