Hive calculates the maximum number of consecutive login days

Time:2021-11-26

1、 Background
In the business requirements of website platform, the demand for “maximum login days” of users is relatively common.

raw data:

u0001 2019-10-10
u0001 2019-10-11
u0001 2019-10-12
u0001 2019-10-14
u0001 2019-10-15
u0001 2019-10-17
u0001 2019-10-18
u0001 2019-10-19
u0001 2019-10-20
u0002 2019-10-20

Note: the data is a simplified version, and the two columns are user_ id,log_ in_ date。 In reality, it is necessary to get the above data by de duplication and conversion from the collected data.

Let’s create a table and import the data into hive:

create table test.user_log_1 (user_id string, log_in_date string) row format delimited fields terminated by ‘ ‘;

load data local inpath ‘/var/lib/hadoop-hdfs/data/user_log.txt’ into table test.user_log_1 ;

Check the data:

hive> select * from test.user_log_1 ;
OK
u0001 2019-10-10
u0001 2019-10-11
u0001 2019-10-12
u0001 2019-10-14
u0001 2019-10-15
u0001 2019-10-17
u0001 2019-10-18
u0001 2019-10-19
u0001 2019-10-20
u0002 2019-10-20
Time taken: 0.076 seconds, Fetched: 10 row(s)

2、 Algorithm
The core is to sort by access time. The login time column subtracts the sorted serial number to get a date value, which can be grouped and counted.Big data training

  1. Step 1: sort
    By user_ ID and log by date_ in_ Date sort:

select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1;

result:

u0001 2019-10-10 1
u0001 2019-10-11 2
u0001 2019-10-12 3
u0001 2019-10-14 4
u0001 2019-10-15 5
u0001 2019-10-17 6
u0001 2019-10-18 7
u0001 2019-10-19 8
u0001 2019-10-20 9

u0002 2019-10-20 1

It can be seen here that the maximum number of consecutive login days of u0001 is 4 days, which can be verified after using the following calculation method.

  1. Step 2: make the date difference between the second column and the third column
    It can be seen that if the date is small, the line number is also small; If you make a difference between the date and the line number, the difference of continuous login should be the same.

select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m;

result:

u0001 2019-10-09
u0001 2019-10-09
u0001 2019-10-09
u0001 2019-10-10
u0001 2019-10-10
u0001 2019-10-11
u0001 2019-10-11
u0001 2019-10-11
u0001 2019-10-11
u0002 2019-10-19

Obviously, the maximum number of consecutive logins is 4.

  1. Step 3: group and sum according to the second column
    select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts;

result:

u0001 2019-10-09 3
u0001 2019-10-10 2
u0001 2019-10-11 4
u0002 2019-10-19 1

  1. Step 4: find the maximum number of times
    The sequence of consecutive login days of each user has been calculated, and then take the maximum login days of each user:

select user_id, max(num) from (select user_id, dts, count(1) num from (select user_id, date_sub(log_in_date, rank) dts from (select user_id, log_in_date, row_number() over(partition by user_id order by log_in_date) as rank from test.user_log_1)m)m2 group by user_id, dts)m3 group by user_id;

The result is consistent with our expectation. The maximum login days of user u0001 is 4.

u0001 4
u0002 1

3、 Extension (maximum trading days of stocks)
We know that the stock market, such as our A shares, does not open at the weekend. If a stock rises the limit last Friday and then rises the limit this Monday, it can be regarded as two consecutive days. It is impossible to use the above method. Try using the lead function:

select user_id, log_in_date, lead(log_in_date) over(partition by user_id order by log_in_date) end_date from test.user_log_1;

result

u0001 2019-10-10 2019-10-11
u0001 2019-10-11 2019-10-12
u0001 2019-10-12 2019-10-14
u0001 2019-10-14 2019-10-15
u0001 2019-10-15 2019-10-17
u0001 2019-10-17 2019-10-18
u0001 2019-10-18 2019-10-19
u0001 2019-10-19 2019-10-20
u0001 2019-10-20 NULL
u0002 2019-10-20 NULL

Ha ha, do you have a train of thought.

Idea: there are three columns in the above results. The first column is uid. Through the lead function, the next two columns are dates, so the dates in both columns are taken from Monday to Friday, that is, there are only working days in the data, and there is no weekend data. You can filter the data in advance to meet the requirements. Since it needs to be continuous, then:

If the date in the third column minus the date in the second column is equal to 1, it is obviously continuous;
If the date in the third column subtracts the date in the second column, the difference is equal to 3, but the date in the third column is Monday, it is also continuous;
Combining the above two conditions, you can calculate the maximum consecutive trading days of the stock. Have you learned to waste it.