This website requires JavaScript.

Hive 窗口函数[译]

窗口函数允许你对一个数据集创建一个"窗口",围绕它进行聚合操作.自Hive 0.11之后引入窗口函数.在本文中,会给各位演示.

Hive中包含的窗口函数

Lead 1.Lead的行数可选,若没有指定则为1行. 2.如果当前行的下一行超出窗口结尾边界则返回Null.

Lag 2.Lag的函数可选,若没有指定则为1行. 3.如果当前行的前一行已经超出窗口开始的边界则返回NULL.

FIRST_VALUE

LAST_VALUE

可用于Over字句的聚合函数

COUNT,SUM,MIN,MAX,AVG

分析函数 RANK ROW_NUMBER DENSE_RANK CUME_DIST PERCENT_RANK NTILE

我们将用股票市场数据进行演示,样本数据可以点此下载.

创建表格并加载数据.

create table stocks (date_ String, Ticker String, Open Double, High Double, Low Double, Close Double, Volume_for_the_day int) row format delimited fields terminated by ',';

creating_table

接下来让我们进一步了解Hive的窗口函数.

Lag

该函数返回指定列前一行的值.你可以用一个整数来指定想要获取行的偏移位置,默认为1.

例子如下:

select ticker,date_,close,lag(close,1) over(partition by ticker) as yesterday_price from acadgild.stocks

这里使用lag显示了昨日的收盘价,lag域over函数组合使用,在over函数中你可以使用partition或order . 下面的截图中你可以看到每行中今天的和昨天的收盘价.

lag

Lead

该函数返回指定列后一行的值.你可以用一个整数来指定想要获取行的偏移位置,默认为1.

这里用Lead来比较当天的收盘价与下一天的收盘价.

select ticker,date_,close,case(lead(close,1) over(partition by ticker)-close)>0 when true then "higher" when false then "lesser" end as Changes from acadgild.stocks

lead

FIRST_VALUE

返回窗口中的第一行. 示例如下.

select ticker,first_value(high) over(partition by ticker) as first_high from acadgild.stocks

first_value

LAST_VALUE

与FIRST_VALUE相反,它返回窗口最后一行.演示如下:

select ticker,last_value(high) over(partition by ticker) as first_high from acadgild.stocks

last_value

Count

根据窗口(over子句)范围进行计数.

select ticker,count(ticker) over(partition by ticker) as cnt from acadgild.stocks

每个partition根据ticker进行计数

count

Sum

窗口(over子句)中的值进行总计计算,下面的查询对ticker进行分区,然后汇总每隔分区的收盘价.

select ticker,sum(close) over(partition by ticker) as total from acadgild.stocks

sum

如果你要对volume_for_the_day进行汇总统计,则可以使用以下查询

select ticker,date_,volume_for_the_day,sum(volume_for_the_day) over(partition by ticker order by date_) as running_total from acadgild.stocks

running_total

显示每行的比例

如果我们想知道volume_for_the_day在总volume的百分比要怎么做呢:

select ticker,date_,volume_for_the_day,(volume_for_the_day*100/(sum(volume_for_the_day) over(partition by ticker))) from acadgild.stocks

percentage

Min

返回窗口(over子句)中最小的数值.

select ticker, min(close) over(partition by ticker) as minimum from acadgild.stocks

min

Max

返回窗口(over子句)中最大的数值.

select ticker, max(close) over(partition by ticker) as maximum from acadgild.stocks

max

AVG

返回窗口(over子句)中的平均值.

select ticker, avg(close) over(partition by ticker) as maximum from acadgild.stocks

avg

Rank

根据窗口的over子句,进行数据排序,如果两个值相同则获得相同的序号,而接下来一个值(序号)会直接过滤. 原文中的示例不好,所以替换掉了...

 select a1,a2,rank () over (order by a1) rank from test;
A1 A2 RANK
---------- ---------- ----------
1 3 1
2 4 2
3 2 3
3 5 3
4 2 5

Row_number

根据窗口的over子句,进行数据排序,连续的序号,不管是否有相同值

select ticker,close,row_number() over(partition by ticker order by close) as num from acadgild.stocks

row_number

Dense_rank

与rank类似但是不会跳过值,原文示例不好,所以替换了..

select a1,a2, dense_rank () over (order by a1) rank from test;
A1 A2 RANK
---------- ---------- ----------
1 3 1
2 4 2
3 2 3
3 5 3
4 2 4

Cume_dist

返回累计分布(cumulative distribution),返回值范围为0~1.比如一共有10个值,那么第一个的值为1/10,第二个为2/10 ...第十个10/10.

select ticker,cume_dist() over(partition by ticker order by close) as cummulative from acadgild.stocks

cume_dist

Percent_rank

返回百分比排序.计算规则为 (序列-1)/(总行数 – 1). 如果返回结果只有一行的话,那么percent_rank的结果就是0

select ticker,close,percent_rank() over(partition by ticker order by close) as closing from acadgild.stocks

percent_rank

Ntile

对返回的数据进行分桶(bucket), 如果你指定Ntile(5)那么它会创建5个桶.然后前20%的记录分到第一个桶.以此类推最后20%分到第5个.

select ticker,ntile(5) over(partition by ticker order by close ) as bucket from acadgild.stocks

ntile

原文链接

Windowing Functions In Hive

0条评论
avatar