什麼是hive的高階分組聚合，它的用法和注意事項以及效能分析

hive的高階分組聚合是指在聚合時使用GROUPING SETS、CUBE和ROLLUP的分組聚合。

高階分組聚合在很多資料庫類SQL中都有出現，並非hive獨有，這裡只說明hive中的情況。

使用高階分組聚合不僅可以簡化SQL語句，而且通常情況下會提升SQL語句的效能。

1.Grouping sets 的使用

範例：

-- 使用方式
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)

Grouping sets的子句允許在一個group by 語句中，指定多個分組聚合列。所有含有Grouping sets 的子句都可以用union連線的多個group by 查詢邏輯來表示。

如下一些常見的等價替換範例：

-- 語句1
select a, b sum(c) from tbl group by a,b grouping sets((a,b))
-- 相當於 
select a,b,sum(c) from tbl group by a,b

-- 語句2
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a)
-- 相當於
select a,b,sum(c) from tbl group by a,b
union
select a,null ,sum(c) from tbl group by a

-- 語句3
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)
-- 相當於
select a,null,sum(c) from tbl group by a
union
select null ,b,sum(c) from tbl group by b

-- 語句4
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a,b,())
-- 相當於
select a,b,sum(c) from tbl group by a,b
union
select a,null,sum(c) from tbl group by a
union
select null,b,sum(c) from tbl group by b
union
select null,null,sum(c) from tbl

可以看到通過等價替換的改寫之後，語句會變得簡潔，效能我們之後分析。

2.cube 和rollup的使用

範例：

-- cube使用範例
select a,b,c,count(1) from tbl group by a,b,c with cube
-- rollup使用範例
select a,b,c,count(1) from tbl group by a,b,c with rollup

用法說明：

以上兩個高階分組函數都可以在一個group by 語句中完成多個分組聚合，它們都可以用grouping sets來等價替換。

cube 會計算所有group by 列的所有組合

-- cube語句
select a,b,c,count(1) from tbl group by a,b,c with cube
-- 相當於
select a,b,c count(1) from tbl group by a,b,c
grouping sets((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

rollup 會按照group by 指定的列從左到右進行分組聚合

-- rollup語句 捲動式聚合
select a,b,c,count(1) from tbl group by a,b,c with rollup
-- 相當於
select a,b,c,count(1) from tbl group by a,b,c s
grouping sets((a,b,c),(a,b),(a),())

3.使用高階分組聚合函數的效能分析

我們可以通過執行計劃的執行來分析高階分組聚合SQL語句的執行過程，比對其優化的節點。

例1 含grouping sets關鍵詞的SQL執行案例。

set hive.map.aggr=true;
explain
-- 小於30歲人群的不同性別平均年齡
select gender,avg(age) as avg_age from temp.user_info_all where ymd = '20230505'
and age < 30 
group by gender;

-- 將以上語句改為grouping sets關鍵詞執行語句
set hive.map.aggr=true;
explain
select gender,avg(age) as num from temp.user_info_all 
where ymd = '20230505'
and age < 30 
group by gender grouping sets((gender));

檢視其執行計劃：

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_info_all
            Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (age < 30) (type: boolean)
              Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: avg(age)
                keys: gender (type: int), 0 (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: int)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
                  Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: struct<count:bigint,sum:double,input:bigint>)
      Reduce Operator Tree:
        Group By Operator
          aggregations: avg(VALUE._col0)
          keys: KEY._col0 (type: int), KEY._col1 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col2
          Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
          pruneGroupingSetId: true
          Select Operator
            expressions: _col0 (type: int), _col2 (type: double)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: true
              Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

對以上內容進行關鍵字解讀：

map階段：

Group By Operator ：Map端開啟聚合操作
aggregations：分組聚合的演演算法，該案例採取avg(age)
keys: 這裡是分組列+ 一個固定列 0
mode:Hash
outputColumnNames:最終輸出三列。_col0, _col1, _col2
Reduce Output Operator：該階段為map階段聚合後的操作
key expressions：map端最終輸出的key，該例為gender和0兩列。
sort order：輸出兩列都正序排序
Map-reduce partition columns：表示Map階段資料輸出的分割區列，該案例為gender和0兩列進行分割區。
value expressions:map端最終輸出value，為一個結構體。

Reduce階段：

Group By Operator：reduce階段的分組聚合操作。
aggregations: 分組聚合演演算法，avg(VALUE._col0)表示對map階段輸出的 value expressions的 _col0取平均值。
keys:指定分組聚合的key，有兩列。為map階段輸出的key。
mode: mergepartial
outputColumnNames: 表示最終輸出的列，該例為gender和num。
pruneGroupingSetId: 表示是否對最終輸出的grouping id進行修剪，如果為true，則表示將keys最後一列拋棄。案例中為0列。
Select Operator：進行列投影操作。
expressions:輸出的列。gender和num。

通過檢視以上的執行計劃，可以看出在使用含有grouping sets語句的SQL中，hive執行計劃並沒有給出具體的實現細節。

再執行具有多個聚合列的範例來看看：

例2 聚合年齡和聚合性別多列合併測試。

set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all 
where ymd = '20230505'
and age < 30 
group by gender,age grouping sets(gender,age);

注：grouping sets後進行分組的列一定要在之前的group by中進行申明。

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_info_all
            Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (age < 30) (type: boolean)
              Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(0)
                keys: gender (type: int), age (type: bigint), 0 (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
                  Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col3
          Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
          pruneGroupingSetId: true
          Select Operator
            expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: true
              Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

通過以上兩個例子可以看出hive執行計劃中沒有具體的高階分組聚合如何實現分組方案。兩者執行方式基本上差不多。

在資料掃描和查詢上的確減少了多次資料掃描和資料io操作。在一定程度上節省了計算資源。

例3 使用cube替代grouping sets 。

set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all 
where ymd = '20230505'
and age < 30 
group by gender,age with cube;

-- 等價語句
select gender,age,count(0) as num from temp.user_info_all 
where ymd = '20230505'
and age < 30 
group by gender,age grouping sets((gender,age),(gender),(age),());

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: user_info_all
            Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (age < 30) (type: boolean)
              Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(0)
                keys: gender (type: int), age (type: bigint), 0 (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
                  Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col3
          Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
          pruneGroupingSetId: true
          Select Operator
            expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: true
              Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

以上例3 cube語句和例2語句輸出資料完全是不一樣的。但其輸出執行計劃內容基本和例2一致。可以看出hive的執行計劃對高階分組聚合拆分執行計劃的支援還不是很好。

使用高階分組聚合，要注意開啟map端聚合模式。

使用高階分組聚合，如上案例，僅使用一個作業就能夠實現union寫法需要多個作業才能實現的邏輯。

從這點上來看能夠減少多個作業在磁碟和網路I/O時的負擔，是一種優化。

但是同時也要注意因過度使用高階分組聚合語句而導致的資料急速膨脹問題。

通常使用簡單的group by 語句，一份資料只有一種聚合結果，一個分組聚合通常只有一個記錄；
使用高階分組聚合，例如cube，在一個作業中一份資料會存在多種聚合情況，最終輸出是，每種聚合情況各自對應一條資料。

注意事項：

如果使用高階分組聚合的語句處理的底表，在資料量很大的情況下容易導致Map或者Reduce任務因硬體資源不足而崩潰。

hive中使用hive.new.job.grouping.set.cardinality 設定項來應對以上情況。

如果SQL語句中處理分組聚合情況超過該設定項指定的值，預設值為（30），則會建立一個新的作業。

下一期：hive視窗分析函數解讀以及帶視窗分析函數的SQL效能分析

按例，歡迎點選此處關注我的個人公眾號，交流更多知識。

後臺回覆關鍵字 hive，隨機贈送一本魯邊備註版珍藏巨量資料書籍。