Hive 窗口函数[译]

窗口函数允许你对一个数据集创建一个”窗口”,围绕它进行聚合操作.自Hive 0.11之后引入窗口函数.在本文中,会给各位演示.

Hive中包含的窗口函数

Lead
1.Lead的行数可选,若没有指定则为1行.
2.如果当前行的下一行超出窗口结尾边界则返回Null.

Lag
2.Lag的函数可选,若没有指定则为1行.
3.如果当前行的前一行已经超出窗口开始的边界则返回NULL.

FIRST_VALUE

LAST_VALUE

可用于Over字句的聚合函数

COUNT,SUM,MIN,MAX,AVG

分析函数
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

我们将用股票市场数据进行演示,样本数据可以点此下载.

创建表格并加载数据.

1
create table stocks (date_ String, Ticker String, Open Double, High Double, Low Double, Close Double, Volume_for_the_day int) row format delimited fields terminated by ',';

creating_table

接下来让我们进一步了解Hive的窗口函数.

Lag

该函数返回指定列前一行的值.你可以用一个整数来指定想要获取行的偏移位置,默认为1.

例子如下:

1
select ticker,date_,close,lag(close,1) over(partition by ticker) as yesterday_price from acadgild.stocks

这里使用lag显示了昨日的收盘价,lag域over函数组合使用,在over函数中你可以使用partition或order .
下面的截图中你可以看到每行中今天的和昨天的收盘价.

lag

Lead

该函数返回指定列后一行的值.你可以用一个整数来指定想要获取行的偏移位置,默认为1.

这里用Lead来比较当天的收盘价与下一天的收盘价.

1
select ticker,date_,close,case(lead(close,1) over(partition by ticker)-close)>0 when true then "higher" when false then "lesser" end as Changes from acadgild.stocks

lead

FIRST_VALUE

返回窗口中的第一行. 示例如下.

1
select ticker,first_value(high) over(partition by ticker) as first_high from acadgild.stocks

first_value

LAST_VALUE

与FIRST_VALUE相反,它返回窗口最后一行.演示如下:

1
select ticker,last_value(high) over(partition by ticker) as first_high from acadgild.stocks

last_value

Count

根据窗口(over子句)范围进行计数.

1
select ticker,count(ticker) over(partition by ticker) as cnt from acadgild.stocks

每个partition根据ticker进行计数

count

Sum

窗口(over子句)中的值进行总计计算,下面的查询对ticker进行分区,然后汇总每隔分区的收盘价.

1
select ticker,sum(close) over(partition by ticker) as total from acadgild.stocks

sum

如果你要对volume_for_the_day进行汇总统计,则可以使用以下查询

1
select ticker,date_,volume_for_the_day,sum(volume_for_the_day) over(partition by ticker order by date_) as running_total from acadgild.stocks

running_total

显示每行的比例

如果我们想知道volume_for_the_day在总volume的百分比要怎么做呢:

1
select ticker,date_,volume_for_the_day,(volume_for_the_day*100/(sum(volume_for_the_day) over(partition by ticker))) from acadgild.stocks

percentage

Min

返回窗口(over子句)中最小的数值.

1
select ticker, min(close) over(partition by ticker) as minimum from acadgild.stocks

min

Max

返回窗口(over子句)中最大的数值.

1
select ticker, max(close) over(partition by ticker) as maximum from acadgild.stocks

max

AVG

返回窗口(over子句)中的平均值.

1
select ticker, avg(close) over(partition by ticker) as maximum from acadgild.stocks

avg

Rank

根据窗口的over子句,进行数据排序,如果两个值相同则获得相同的序号,而接下来一个值(序号)会直接过滤. 原文中的示例不好,所以替换掉了…

1
select a1,a2,rank () over (order by a1) rank from test;
A1 A2 RANK
---------- ---------- ----------
1 3 1
2 4 2
3 2 3
3 5 3
4 2 5

Row_number

根据窗口的over子句,进行数据排序,连续的序号,不管是否有相同值

1
select ticker,close,row_number() over(partition by ticker order by close) as num from acadgild.stocks

row_number

Dense_rank

与rank类似但是不会跳过值,原文示例不好,所以替换了..

1
select a1,a2, dense_rank () over (order by a1) rank from test;
A1 A2 RANK
---------- ---------- ----------
1 3 1
2 4 2
3 2 3
3 5 3
4 2 4

Cume_dist

返回累计分布(cumulative distribution),返回值范围为0~1.比如一共有10个值,那么第一个的值为1/10,第二个为2/10 …第十个10/10.

1
select ticker,cume_dist() over(partition by ticker order by close) as cummulative from acadgild.stocks

cume_dist

Percent_rank

返回百分比排序.计算规则为 (序列-1)/(总行数 – 1). 如果返回结果只有一行的话,那么percent_rank的结果就是0

1
select ticker,close,percent_rank() over(partition by ticker order by close) as closing from acadgild.stocks

percent_rank

Ntile

对返回的数据进行分桶(bucket), 如果你指定Ntile(5)那么它会创建5个桶.然后前20%的记录分到第一个桶.以此类推最后20%分到第5个.

1
select ticker,ntile(5) over(partition by ticker order by close ) as bucket from acadgild.stocks

ntile

原文链接

Windowing Functions In Hive

打赏支持:支付宝/微信,感谢赏口饭吃