2個場景範例講解GaussDB(DWS)基表統計資訊估算不準的處理方案

摘要：通過2個範例場景講解GaussDB(DWS)運維解決方案。

本文分享自華為雲社群《GaussDB(DWS)運維 -- 基表統計資訊估算不準的常見場景及處理方案》，作者：譡裡個檔。

場景1：基表過濾欄位存在的隱式型別時，基錶行數估算偏小

這種場景絕大部分場景DWS能夠處理，但是如果隱式型別轉後的結果與統計資訊中的欄位列舉值的表示式不一樣，就會導致估算的嚴重偏差

原始SQL如下

SELECT * FROM dmgrpdi.dwl_inv_res_rpt_ci_grp_f WHERE period_id=202212 AND source_flag=1;

對應的執行計劃

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  id |                            operation                             | E-rows | E-memory | E-width |  E-costs
 ----+------------------------------------------------------------------+--------+----------+---------+-----------
 1 | -> Row Adapter                                                  | 14160 | | 717 | 680025.43
 2 | ->  Vector Streaming (type: GATHER) | 14160 | | 717 | 680025.43
 3 | ->  Vector Partition Iterator                              | 14160 | 1MB      | 717 | 678241.33
 4 | ->  Partitioned CStore Scan on dwl_inv_res_rpt_ci_grp_f | 14160 | 1MB      | 717 | 678241.33
                  Predicate Information (identified by plan id)
 -------------------------------------------------------------------------------
 3 --Vector Partition Iterator
         Iterations: 1
 4 --Partitioned CStore Scan on dwl_inv_res_rpt_ci_grp_f
         Filter: ((period_id = 202212::numeric) AND ((source_flag)::bigint = 1))
         Pushdown Predicate Filter: (period_id = 202212::numeric)
         Partitions Selected by Static Prune: 36

發現source_flag欄位上存在隱式型別轉換，查詢欄位source_flag的統計資訊

postgres=# SELECT most_common_vals,most_common_freqs, histogram_bounds  FROM pg_stats WHERE tablename = 'dwl_inv_res_rpt_ci_grp_f' AND attname = 'source_flag';
 most_common_vals | most_common_freqs | histogram_bounds
------------------+-----------------------------------+------------------
 {01,02,04,03}    | {.440034,.241349,.217413,.101089} | {05,06}
(1 row)

發現隱式型別轉後的結果（1）與統計資訊中的欄位列舉值（'01'）的表示式不一樣

處理方案：修改過濾條件，禁止型別轉換，並且使用正確的常數值書寫過濾條件

如上SQL語句中的source_flag=1修改為source_flag='01'，修改後SQL語句如下

SELECT * FROM dmgrpdi.dwl_inv_res_rpt_ci_grp_f WHERE period_id=202212 AND source_flag='01';

查詢新語句的執行計劃

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
  id |                            operation                             |  E-rows | E-memory | E-width |  E-costs
 ----+------------------------------------------------------------------+-----------+----------+---------+-----------
 1 | -> Row Adapter                                                  | 108359075 | | 717 | 480542.98
 2 | ->  Vector Streaming (type: GATHER) | 108359075 | | 717 | 480542.98
 3 | ->  Vector Partition Iterator                              | 108359075 | 1MB      | 717 | 478758.88
 4 | ->  Partitioned CStore Scan on dwl_inv_res_rpt_ci_grp_f | 108359075 | 1MB      | 717 | 478758.88
                           Predicate Information (identified by plan id)
 -------------------------------------------------------------------------------------------------
 3 --Vector Partition Iterator
         Iterations: 1
 4 --Partitioned CStore Scan on dwl_inv_res_rpt_ci_grp_f
         Filter: ((period_id = 202212::numeric) AND (source_flag = '01'::text))
         Pushdown Predicate Filter: ((period_id = 202212::numeric) AND (source_flag = '01'::text))
         Partitions Selected by Static Prune: 36

場景2：基表在多列組合主鍵上過濾時，基錶行數估算偏大

這種場景是因為DWS對基表上多個過濾條件之間採取弱相關性處理，當多個過濾條件是主鍵時，可能導致結果集估算偏大。

原始SQL如下

SELECT * FROM mca.mca_period_rate_t mca_rate2
WHERE period_number = '202208' AND from_currency_code = 'RMB' AND to_currency_code = 'USD'

執行資訊如下

 id |                      operation                       |       A-time | A-rows | E-rows | Peak Memory | E-memory | A-width | E-width | E-costs  
----+------------------------------------------------------+--------------------+--------+--------+-------------+----------+---------+---------+----------
 1 | -> Row Adapter                                      | 444.735 | 1 | 2033 | 227KB       | | | 321 | 22601.41 
 2 | ->  Vector Streaming (type: GATHER) | 444.720 | 1 | 2033 | 873KB       | | | 321 | 22601.41 
 3 | -> CStore Scan on mca_period_rate_t mca_rate2 | [435.167, 435.167] | 1 | 2033 | [5MB, 5MB] | 1MB      | | 321 | 22427.41 
                                                              Predicate Information (identified by plan id) 
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 3 --CStore Scan on mca_period_rate_t mca_rate2
        Filter: (((period_number)::text = '202208'::text) AND ((from_currency_code)::text = 'RMB'::text) AND ((to_currency_code)::text = 'USD'::text))
 Rows Removed by Filter: 425812
        Pushdown Predicate Filter: (((period_number)::text = '202208'::text) AND ((from_currency_code)::text = 'RMB'::text) AND ((to_currency_code)::text = 'USD'::text))

可以發現基表mca.mca_period_rate_t的行數估算嚴重偏大。

使用如下SQL語句檢視表mca.mca_period_rate_t的定義

SELECT pg_get_tabledef('mca.mca_period_rate_t'::regclass);

查詢表mca.mca_period_rate_t定義

SELECT pg_get_tabledef('mca.mca_period_rate_t');
SET search_path = mca;
CREATE TABLE mca_period_rate_t (
seq numeric NOT NULL,
period_number character varying(10) NOT NULL,
from_currency_code character varying(20) NOT NULL,
to_currency_code character varying(20) NOT NULL,
begin_rate numeric(35,18),
end_rate numeric(35,18),
avg_rate numeric(35,18),
creation_date timestamp(0) without time zone NOT NULL,
created_by numeric NOT NULL,
last_update_date timestamp(0) without time zone,
last_updated_by numeric,
rmb_begin_rate numeric(35,18),
usd_begin_rate numeric(35,18),
rmb_end_rate numeric(35,18),
usd_end_rate numeric(35,18),
rmb_avg_rate numeric(35,18),
usd_avg_rate numeric(35,18),
crt_cycle_id numeric,
crt_job_instance_id numeric,
last_upd_cycle_id numeric,
upd_job_instance_id numeric,
cdc_key_id character varying(128) DEFAULT sys_guid(),
end_rate2 numeric(35,18),
avg_rate2 numeric(35,18),
last_period_end_rate numeric(35,18)
)
WITH (orientation=column, compression=low, colversion=2.0, enable_delta=false)
DISTRIBUTE BY REPLICATION
TO GROUP group_version1;
CREATE UNIQUE INDEX mca_period_rate_u1 ON mca.mca_period_rate_t USING cbtree (period_number, from_currency_code, to_currency_code) TABLESPACE pg_default;

發現 (period_number, from_currency_code, to_currency_code) 為組合的唯一索引。

處理方案：對組合索引列收多列統計資訊

注意此種方案只適用在基表比較小的情況下。因為多列統計資訊需要使用百分比取樣的方式計算統計資訊，當表比較大時，統計資訊計算耗時回很長。

針對如上查詢語句執行如下語句收集(period_number, from_currency_code, to_currency_code) 多列統計資訊

ANALYZE mca.mca_period_rate_t((period_number, from_currency_code, to_currency_code));

收集多列統計資訊之後，基表的行數估算恢復正產

 id |                                      operation                                      |       A-time | A-rows | E-rows | Peak Memory | A-width | E-width | E-costs 
----+-------------------------------------------------------------------------------------+--------------------+--------+--------+-------------+---------+---------+---------
 1 | -> Row Adapter                                                                     | 195.504 | 1 | 1 | 227KB       | | 321 | 675.14 
 2 | ->  Vector Streaming (type: GATHER) | 195.491 | 1 | 1 | 873KB       | | 321 | 675.14 
 3 | -> CStore Index Scan using mca_period_rate_u1 on mca_period_rate_t mca_rate2 | [164.344, 164.344] | 1 | 1 | [5MB, 5MB] | | 321 | 501.14 
                                                      Predicate Information (identified by plan id) 
----------------------------------------------------------------------------------------------------------------------------------------------------------
 3 --CStore Index Scan using mca_period_rate_u1 on mca_period_rate_t mca_rate2
 Index Cond: (((period_number)::text = '202208'::text) AND ((from_currency_code)::text = 'RMB'::text) AND ((to_currency_code)::text = 'USD'::text))

點選關注，第一時間瞭解華為雲新鮮技術~