hive的高階分組聚合是指在聚合時使用GROUPING SETS、CUBE和ROLLUP的分組聚合。
高階分組聚合在很多資料庫類SQL中都有出現,並非hive獨有,這裡只說明hive中的情況。
使用高階分組聚合不僅可以簡化SQL語句,而且通常情況下會提升SQL語句的效能。
範例:
-- 使用方式
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)
Grouping sets的子句允許在一個group by 語句中,指定多個分組聚合列。所有含有Grouping sets 的子句都可以用union連線的多個group by 查詢邏輯來表示。
如下一些常見的等價替換範例:
-- 語句1
select a, b sum(c) from tbl group by a,b grouping sets((a,b))
-- 相當於
select a,b,sum(c) from tbl group by a,b
-- 語句2
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a)
-- 相當於
select a,b,sum(c) from tbl group by a,b
union
select a,null ,sum(c) from tbl group by a
-- 語句3
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)
-- 相當於
select a,null,sum(c) from tbl group by a
union
select null ,b,sum(c) from tbl group by b
-- 語句4
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a,b,())
-- 相當於
select a,b,sum(c) from tbl group by a,b
union
select a,null,sum(c) from tbl group by a
union
select null,b,sum(c) from tbl group by b
union
select null,null,sum(c) from tbl
可以看到通過等價替換的改寫之後,語句會變得簡潔,效能我們之後分析。
範例:
-- cube使用範例
select a,b,c,count(1) from tbl group by a,b,c with cube
-- rollup使用範例
select a,b,c,count(1) from tbl group by a,b,c with rollup
用法說明:
以上兩個高階分組函數都可以在一個group by 語句中完成多個分組聚合,它們都可以用grouping sets來等價替換。
-- cube語句
select a,b,c,count(1) from tbl group by a,b,c with cube
-- 相當於
select a,b,c count(1) from tbl group by a,b,c
grouping sets((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())
-- rollup語句 捲動式聚合
select a,b,c,count(1) from tbl group by a,b,c with rollup
-- 相當於
select a,b,c,count(1) from tbl group by a,b,c s
grouping sets((a,b,c),(a,b),(a),())
我們可以通過執行計劃的執行來分析高階分組聚合SQL語句的執行過程,比對其優化的節點。
例1 含grouping sets關鍵詞的SQL執行案例。
set hive.map.aggr=true;
explain
-- 小於30歲人群的不同性別平均年齡
select gender,avg(age) as avg_age from temp.user_info_all where ymd = '20230505'
and age < 30
group by gender;
-- 將以上語句改為grouping sets關鍵詞執行語句
set hive.map.aggr=true;
explain
select gender,avg(age) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender grouping sets((gender));
檢視其執行計劃:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(age)
keys: gender (type: int), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: struct<count:bigint,sum:double,input:bigint>)
Reduce Operator Tree:
Group By Operator
aggregations: avg(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col2
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col2 (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
對以上內容進行關鍵字解讀:
map階段:
Reduce階段:
通過檢視以上的執行計劃,可以看出在使用含有grouping sets語句的SQL中,hive執行計劃並沒有給出具體的實現細節。
再執行具有多個聚合列的範例來看看:
例2 聚合年齡和聚合性別多列合併測試。
set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age grouping sets(gender,age);
注:grouping sets後進行分組的列一定要在之前的group by中進行申明。
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(0)
keys: gender (type: int), age (type: bigint), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
通過以上兩個例子可以看出hive執行計劃中沒有具體的高階分組聚合如何實現分組方案。兩者執行方式基本上差不多。
在資料掃描和查詢上的確減少了多次資料掃描和資料io操作。在一定程度上節省了計算資源。
例3 使用cube替代grouping sets 。
set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age with cube;
-- 等價語句
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age grouping sets((gender,age),(gender),(age),());
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(0)
keys: gender (type: int), age (type: bigint), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
以上例3 cube語句和例2語句輸出資料完全是不一樣的。但其輸出執行計劃內容基本和例2一致。可以看出hive的執行計劃對高階分組聚合拆分執行計劃的支援還不是很好。
使用高階分組聚合,要注意開啟map端聚合模式。
使用高階分組聚合,如上案例,僅使用一個作業就能夠實現union寫法需要多個作業才能實現的邏輯。
從這點上來看能夠減少多個作業在磁碟和網路I/O時的負擔,是一種優化。
但是同時也要注意因過度使用高階分組聚合語句而導致的資料急速膨脹問題。
通常使用簡單的group by 語句,一份資料只有一種聚合結果,一個分組聚合通常只有一個記錄;
使用高階分組聚合,例如cube,在一個作業中一份資料會存在多種聚合情況,最終輸出是,每種聚合情況各自對應一條資料。
注意事項:
如果使用高階分組聚合的語句處理的底表,在資料量很大的情況下容易導致Map或者Reduce任務因硬體資源不足而崩潰。
hive中使用hive.new.job.grouping.set.cardinality
設定項來應對以上情況。
如果SQL語句中處理分組聚合情況超過該設定項指定的值,預設值為(30),則會建立一個新的作業。
下一期:hive視窗分析函數解讀以及帶視窗分析函數的SQL效能分析
按例,歡迎點選此處關注我的個人公眾號,交流更多知識。
後臺回覆關鍵字 hive,隨機贈送一本魯邊備註版珍藏巨量資料書籍。