Hive function — windowing function 1

Time:2021-7-24

Windowing function

First acquaintance

Window: the range of datasets that the function evaluates at run time
Function: running function!

Only the following functions are supported:

Windowing functions:

LEAD:

LEAD (scalar_ Expression [, offset] [, default]): returns the column value of the specified column of the following n rows of the current row! If not found, the default value is used

LAG:

LAG (scalar_ Expression [, offset] [, default]): returns the column value of the specified column of N rows above the current row! If not found, the default value is used

FIRST_VALUE:

FIRST_ Value (column name, [false (default)]): returns the first value of the specified column in the current window. If the second parameter is true, it means that the first value added is null. Skip the null value and continue searching!

LAST_VALUE:

LAST_ Value (column name, [false (default)]): returns the last value of the specified column in the current window. If the second parameter is true, it means that the first value added is null. Skip the null value and continue searching!

Functions of statistics class (generally used in combination with over):

min,max,avg,sum,count

Ranking analysis:

RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

be careful:

Not all functions can control the range of calculated data sets by changing the size of the window!
All ranking functions, lag and lead support the use of over(), but window cannot be defined in over()_ clause

Format:

Function over (partition by field, order by field, window_ clause )
The size of the window can be through windows_ Clause to specify:

(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between start range and end range

(rows | range) between current row and (current row | (unbounded | [num]) following)

(rows | range) between [num] following and (unbounded | [num]) following
exceptional case:
  • Windows does not appear in over()_ Claim, and no order by appears. The default window is rows between unbounded forecasting and unbounded following
  • Windows does not appear in over()_ Claim), specify order by, and the window defaults to rows between unbounded forecasting and current row
What is the difference between window functions and grouping?
  1. If it is a grouping operation, only the grouped fields can be written after select
  2. If it is a window function, the window function executes the function once for each record in the specified window
  3. If it is a grouping operation, it has the effect of de duplication, but the partition does not de duplication!

Windowing function format: function name (column) over (option)

The over keyword indicates that the function is treated as a windowing function rather than an aggregate function.
Options are often added in parentheses after the over keyword to change the scope of the window for aggregation.

insert overwrite table `...` partition(platform_id)
select
t.user_id,...,t.platform_id from (select u.*,row_number() over (partition by u.user_id order by u.updated_at desc) as rn from `...` u) t where t.rn=1;

Partition by clause:

After the over keyword of the windowing function, the partition by clause can be used to define the partition of the row for aggregation calculation. Unlike the group by clause, the partition created by the partition by clause is independent of the result set. The created partition is only for aggregation calculation, and the partitions created by different windowing functions do not affect each other.

Multiple windowing functions can be used simultaneously in the same select statement, and these windowing functions will not interfere with each other
Disturbing.

--Display the information of each person, the number of people in the city and the number of peers:
select fname,
       fcity,
       fage,
       fsalary,
       Count (*) over (partition by fcity) the number of people in the city,
       Count (*) over (partition by Fage) number of peers
  from t_person

Order by clause:

Order by field name range|rows between boundary rule 1 and boundary rule 2

Example 1:

select fname,
       fcity,
       fage,
       fsalary,
       Sum (fsalary) over (order by fsalary rows between unbounded preceding and current row)
  from t_person

The windowing function “sum (fsalary) over (order by fsalary rows between unbounded prediction and current row)” here means sorting according to fsalary, and then calculating the sum from the first row (unbounded prediction) to the current row (current row). Such calculation result is the cumulative sum of salary values sorted according to salary.

Example 2:

select fname,
       fcity,
       fage,
       fsalary,
       Sum (fsalary) over (order by fsalary range between unbounded preceding and current row)
  from t_person

The only difference between the SQL statement in example 2 and that in example 1 is that “rows” is replaced by “range”“ “Rows” is positioned according to the number of rows, while “range” is positioned according to the value range. These two different positioning methods are mainly used to deal with parallel sorting. For example, the salary of lily, swing and bill is 2000 yuan
If “rows” is used to locate the range, the cumulative sum from the first item to the current line will be calculated. If “range” is used to locate the range, the cumulative sum from the first item to the current line will still be calculated. However, since there are three people with a salary Equal to 2000 yuan, the cumulative sum calculated ends with those with a salary from the first item to 2000 yuan, so for Lily, swing Bill, all three people get 7000 (“1000 + 2000 + 2000 + 2000”) when calculating the aggregation of windowing functions.

row_ Number () over (partition by grouping column order by   Sort column (DESC)

row_ Number () over has the function of grouping and sorting

Example 3:

SELECT FName, FSalary,
Sum (fsalary) over (order by fsalary rows between 1 following and 3 following)
FROM T_Person;

The windowing function “sum (fsalary) over (order by fsalary rows between 1 following and 3 following)” here means sorting according to fsalary, and then calculating the salary and from the next row (1 following) to the last three rows (3 following) of the current row. Note that the last line has no subsequent lines, and its calculation result is
Null value is null instead of 0.

SELECT FName, FSalary,FAge,
Max (fsalary) over (order by Fage) the maximum value before this line
FROM T_Person;

The windowing function “max (fsalary) over (order by Fage)” here is a simplified expression of “max (fsalary) over (order by Fage range between unbounded forecasting and current row)”. It means sorting according to fsalary, and then calculating the maximum salary value of personnel from the first row (unbounded forecasting) to the current row (current row).

Tips

  • Count() after grouping, count all the data in a group!

reference:
https://www.cnblogs.com/lihaoyang/p/6756956.html

Recommended Today

The selector returned by ngrx store createselector performs one-step debugging of fetching logic

Test source code: import { Component } from ‘@angular/core’; import { createSelector } from ‘@ngrx/store’; export interface State { counter1: number; counter2: number; } export const selectCounter1 = (state: State) => state.counter1; export const selectCounter2 = (state: State) => state.counter2; export const selectTotal = createSelector( selectCounter1, selectCounter2, (counter1, counter2) => counter1 + counter2 ); // […]