Hive開啟向量化模式也是hiveSQL優化方法中的一種,可以提升hive查詢速率,也叫hive向量化。
問題1:那麼什麼是hive向量化模式呢?
問題2:hive向量化什麼情況下可以被使用,或者說它有哪些使用場景呢?
問題3:如何檢視hive向量化使用的相關資訊?
hive向量化模式是hive的一個特性,也叫hive向量化,在沒有引入向量化的執行模式之前,一般的查詢操作一次只處理一行資料,在向量化查詢執行時一次處理1024行的塊來簡化系統底層的操作,提高了資料處理的效能。
在底層,hive提供的向量模式,並不是重寫了Mapper函數,而是通過實現inputformat介面,建立了VectorizedParquetInputFormat類,來構建一個批次輸入的陣列。
向量化模式開啟的方式如下:
-- 開啟hive向量化模式
set hive.vectorized.execution.enabled = true;
Hive向量化模式並不是可以直接使用,它對使用的計算引擎,使用資料的資料型別,以及使用的SQL函數都有一定的要求。
不同的計算引擎支援程度不一樣:MR計算引擎僅支援Map階段的向量化,Tez和Spark計算引擎可以支援Map階段和Reduce階段的向量化。
hive檔案儲存型別必須為ORC或者Parquet等列儲存檔案型別。
以上資料型別為向量化模式支援的資料型別,如果使用其他資料型別,例如array和map等,開啟了向量化模式查詢,查詢操作將使用標準的方式單行執行,但不會報錯。
算數表示式: +, -, *, /, %
邏輯關係:AND, OR, NOT
比較關係(過濾器): <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
使用 AND, OR, NOT, <, >, <=, >=, =, != 等布林值表示式(非過濾器)
空值校驗:IS [NOT] NULL
所有的數學函數,例如 SIN, LOG等
字串函數: SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
型別轉換:cast
Hive UDF函數, 包括標準和通用的UDF函數
日期函數:YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP
IF條件表示式
以上函數表示式在執行時支援使用向量化模式。
檢視hive向量化資訊是前置的,可以通過執行計劃命令explain vectorization檢視向量化描述資訊。當然,執行中,也可以通過紀錄檔瞭解向量化執行資訊,但相對篩選關鍵資訊比較複雜。
explain vectorization是在hive2.3.0版本之後釋出的功能,可以檢視map階段和reduce階段為什麼沒有開啟向量化模式,類似偵錯功能。
explain vectorization支援的語法:explain vectorization [only] [summary|operator|expression|detail]
接下來我們通過範例來檢視以上命令的展示內容:
例1 關閉向量化模式的情況下,使用explain vectorization only。
-- 關閉向量化模式
set hive.vectorized.execution.enabled = false;
explain vectorization only
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
PLAN VECTORIZATION:
enabled: false #標識向量化模式沒有開啟
enabledConditionsNotMet: [hive.vectorized.execution.enabled IS false] #未開啟原因
如上,如果關閉向量化模式,輸出結果中PLAN VECTORIZATION 這裡可以看到該模式沒有被開啟,原因是沒有滿足enabledConditionsNotMet 指代的條件。
例2 開啟向量化模式的情況下,使用explain vectorization only。
-- 開啟向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization only
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
PLAN VECTORIZATION:
enabled: true
enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Execution mode: vectorized
Map Vectorization:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
groupByVectorOutput: true
inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
allNative: false
usesVectorUDFAdaptor: false
vectorized: true
Reduce Vectorization:
enabled: false
enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
Stage: Stage-0
Fetch Operator
執行結果有三部分內容:
其中STAGE PLANS列印的並不是explain中map和reduce階段的執行資訊,而是這兩個階段使用向量化模式的資訊。
對以上案例內容進行關鍵詞解讀:
以上整個過程在map階段執行了向量化模式,在reduce階段沒有執行向量化模式,是因為上文提到的reduce階段mr計算引擎不支援,需要tez或spark計算引擎。
可以執行以下命令:
-- 開啟向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization only summary
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
會發現explain vectorization only
命令和explain vectorization only summary
命令執行結果完全一致。
後續其他命令也類似,explain vectorization
等同於explain vectorization summary
,summary引數是一個預設引數,可以忽略。
例3 使用explain vectorization命令檢視hive向量化模式執行資訊。
-- 開啟向量化模式
set hive.vectorized.execution.enabled = true;
explain vectorization
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
其執行結果是explain和explain vectorization only兩者相加執行結果:
PLAN VECTORIZATION:
enabled: true
enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((age < 30) and (nick like '%小%')) (type: boolean)
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
Select Operator ... #省略部分
# 向量化模式描述資訊
Execution mode: vectorized
Map Vectorization:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
groupByVectorOutput: true
inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
allNative: false
usesVectorUDFAdaptor: false
vectorized: true
Reduce Vectorization:
enabled: false
enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
... #省略部分
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
... 為省略了一部分資訊。
使用explain vectorization operator可以檢視顯示執行計劃過程中運運算元的向量化資訊和explain執行階段資訊。
簡化版為explain vectorization only operator,加only相對前者少的部分為explain執行階段資訊,下同。explain執行階段資訊我們就不查詢了,感興趣小夥伴可以自行查詢檢視。
例4 簡化版為explain vectorization only operator檢視hiveSQL向量化描述資訊。
set hive.vectorized.execution.enabled = true;
explain vectorization only operator
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
PLAN VECTORIZATION:
enabled: true
enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
# 表掃描的向量化資訊
TableScan Vectorization:
# 讀表採用原生的向量化模式掃描
native: true
# 過濾操作的向量化資訊
Filter Vectorization:
# 過濾操作的類
className: VectorFilterOperator
# 過濾採用原生的向量化模式
native: true
# 列篩選的向量化資訊
Select Vectorization:
className: VectorSelectOperator
native: true
# 聚合操作的向量化資訊
Group By Vectorization:
className: VectorGroupByOperator
# 輸出採用向量化輸出
vectorOutput: true
#非本地操作
native: false
# reduce output向量化資訊
Reduce Sink Vectorization:
className: VectorReduceSinkOperator
native: false
# 已滿足的Reduce Sink向量化條件
nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
# 不滿足的Reduce Sink向量化條件
nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
# 向量化描述資訊,同explain vectorization only,不作標註了。
Execution mode: vectorized
Map Vectorization:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
groupByVectorOutput: true
inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
allNative: false
usesVectorUDFAdaptor: false
vectorized: true
Reduce Vectorization:
enabled: false
enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
Reduce Operator Tree:
Group By Vectorization:
vectorOutput: false
native: false
Stage: Stage-0
Fetch Operator
以上內容關鍵詞在程式碼塊有行註釋標註,可以看到explain vectorization only operator命令多了在explain執行計劃過程中增加了具體每一個運運算元(operator)步驟的是否向量化及具體資訊。如果不滿足向量化步驟,哪些條件滿足,哪些條件不滿足,也做了標註。
expression:補充顯示錶示式的向量化資訊,例如謂詞表示式。還包括 summary 和 operator 的所有資訊。
例5 簡化版explain vectorization only expression命令檢視hiveSQL執行計劃表示式的向量化資訊。
set hive.vectorized.execution.enabled = true;
explain vectorization only expression
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
# 同explain vectorization
PLAN VECTORIZATION:
enabled: true
enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
# 同explain vectorization
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan Vectorization:
native: true
# 表示表掃描後有25列。
projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
Filter Vectorization:
className: VectorFilterOperator
native: true
# 表示謂詞過濾少選有兩列,以及過濾條件的內容。
predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%))
Select Vectorization:
className: VectorSelectOperator
native: true
# 表示進行列篩選的具體列,這裡是第12列,陣列下標為11.如果為空[],則表示任何一個列。
projectedOutputColumns: [11]
Group By Vectorization:
# 表示使用VectorUDAFCount的方法進行count計數統計以及輸出型別。
aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint
className: VectorGroupByOperator
vectorOutput: true
# 聚合列
keyExpressions: col 11
native: false
# 輸出為一個新的陣列,只有一列
projectedOutputColumns: [0]
Reduce Sink Vectorization:
className: VectorReduceSinkOperator
native: false
nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
# 向量化描述資訊,同explain vectorization only,不作標註了。
Execution mode: vectorized
Map Vectorization:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
groupByVectorOutput: true
inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
allNative: false
usesVectorUDFAdaptor: false
vectorized: true
Reduce Vectorization:
enabled: false
enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
Reduce Operator Tree:
Group By Vectorization:
vectorOutput: false
native: false
projectedOutputColumns: null
Stage: Stage-0
Fetch Operator
以上列印資訊內容可以看出 explain vectorization only expression命令相對列印的資訊是更細粒度到欄位級別的資訊了。基本上將操作的每一列是否使用向量化處理都列印了出來,這樣我們可以很好的判斷哪些欄位型別是不支援向量化模式的。
explain vectorization only detail 檢視最詳細級別的向量化資訊。它包括summary、operator、expression的所有資訊。
例6 explain vectorization only detail 檢視最詳細級別的向量化資訊。
set hive.vectorized.execution.enabled = true;
explain vectorization only detail
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
PLAN VECTORIZATION:
enabled: true
enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
# 同explain vectorization only expression
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan Vectorization:
native: true
projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
Filter Vectorization:
className: VectorFilterOperator
native: true
predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%))
Select Vectorization:
className: VectorSelectOperator
native: true
projectedOutputColumns: [11]
Group By Vectorization:
aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint
className: VectorGroupByOperator
vectorOutput: true
keyExpressions: col 11
native: false
projectedOutputColumns: [0]
Reduce Sink Vectorization:
className: VectorReduceSinkOperator
native: false
nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false
# 向量化描述資訊這裡做了更詳細的描述
Execution mode: vectorized
Map Vectorization:
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
groupByVectorOutput: true
inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
allNative: false
usesVectorUDFAdaptor: false
vectorized: true
rowBatchContext:
dataColumnCount: 24
includeColumns: [7, 11]
dataColumns: uid:bigint, reg_time:string, cc:string, client:string, if_new:int, last_location:string, platform_reg:string, nick:string, gender:int, birthday:string, constellation:string, age:bigint, description:string, is_realname:int, realname_date:string, last_active_day:string, is_active:int, user_status:int, user_ua:string, vst_cnt:bigint, vst_dur:bigint, is_vip:int, chat_uv:bigint, chat_cnt:bigint
partitionColumnCount: 1
partitionColumns: ymd:string
scratchColumnTypeNames: bigint
Reduce Vectorization:
enabled: false
enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true
enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false
Reduce Operator Tree:
Group By Vectorization:
vectorOutput: false
native: false
projectedOutputColumns: null
Stage: Stage-0
Fetch Operator
通過以上內容可以看出 explain vectorization only detail列印的資訊其中執行計劃部分內容和explain vectorization only expression粒度一致,在向量化描述資訊部分做了更細粒度的描述,到欄位級別。
以上就是hive向量化explain vectorization相關引數的使用,其命令在我們使用向量化模式中進行驗證支援的函數和資料型別逐步遞進,可以根據需要使用。
而hive向量化模式可以極大程度的優化hive執行速度。
例7 執行優化速度比對。
-- 程式碼1 開啟向量化模式
set hive.vectorized.execution.enabled = true;
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
-- 程式碼2 關閉向量化模式
set hive.vectorized.execution.enabled = false;
select age,count(0) as num from temp.user_info_all where ymd = '20230505'
and age < 30 and nick like '%小%'
group by age;
執行結果:
# 程式碼1執行結果開啟向量化模式
MapReduce Total cumulative CPU time: 1 minutes 1 seconds 740 msec
Ended Job = job_1675664438694_13647623
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 5 Cumulative CPU: 61.74 sec HDFS Read: 367242142 HDFS Write: 1272 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 1 seconds 740 msec
OK
15 23
... # 省略資料
29 81849
Time taken: 41.322 seconds, Fetched: 31 row(s)
# 程式碼2執行結果關閉向量化模式
MapReduce Total cumulative CPU time: 1 minutes 39 seconds 190 msec
Ended Job = job_1675664438694_13647754
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 5 Cumulative CPU: 99.19 sec HDFS Read: 367226626 HDFS Write: 1272 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 39 seconds 190 msec
OK
15 23
... # 省略資料
29 81849
Time taken: 50.724 seconds, Fetched: 31 row(s)
以上結果可以看出,開啟向量化模式執行結果查詢耗時減少,雖然減少的不多,但在CPU使用上少了三分之一的資源。可見開啟向量化模式不僅可以提高查詢速度,還可以節省查詢資源。
以上開啟向量化模式為mr引擎測試結果,tez和spark還具有更優的執行表現。
下一期:Hive執行計劃之只有map階段SQL效能分析和解讀
按例,歡迎點選此處關注我的個人公眾號,交流更多知識。
後臺回覆關鍵字 hive,隨機贈送一本魯邊備註版珍藏巨量資料書籍。