基於案例分析 MySQL Group Replication 的故障檢測流程

故障檢測（Failure Detection）是 Group Replication 的一個核心功能模組，通過它可以及時識別叢集中的故障節點，並將故障節點從叢集中剔除掉。如果不將故障節點及時剔除的話，一方面會影響叢集的效能，另一方面還會阻止叢集拓撲的變更。

下面結合一個具體的案例，分析 Group Replication 的故障檢測流程。

除此之外，本文還會分析以下問題。

當出現網路分割區時，對於少數派節點，會有什麼影響？
什麼是 XCom Cache？如何預估 XCom Cache 的大小？
線上上，為什麼 group_replication_member_expel_timeout 不宜設定過大？

案例

以下是測試叢集的拓撲，多主模式。

主機名	IP	角色
node1	192.168.244.10	PRIMARY
node2	192.168.244.20	PRIMARY
node3	192.168.244.30	PRIMARY

本次測試主要包括兩步：

模擬網路分割區，看它對叢集各節點的影響。
恢復網路連線，看看各節點又是如何反應的。

模擬網路分割區

首先模擬網路分割區故障，在 node3 上執行。

# iptables -A INPUT  -p tcp -s 192.168.244.10 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.10 -j DROP

# iptables -A INPUT  -p tcp -s 192.168.244.20 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.20 -j DROP

# date "+%Y-%m-%d %H:%M:%S"
2022-07-31 13:03:01

其中，iptables 命令會斷開 node3 與 node1、node2 之間的網路連線。date 記錄了命令執行的時間。

命令執行完 5s（這個時間是固定的，在原始碼中通過 DETECTOR_LIVE_TIMEOUT 指定），各個節點開始響應（從各節點的紀錄檔中可以觀察到這一點）

首先看看 node1 的紀錄檔及叢集狀態。

2022-07-31T13:03:07.582519-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.30:3306 has become unreachable.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | ONLINE       | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | ONLINE       | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | UNREACHABLE  | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

從 node1，node2 的角度來看，此時 node3 處於 UNREACHABLE 狀態。

接下來看看 node3 的。

2022-07-31T13:03:07.690416-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 has become unreachable.'
2022-07-31T13:03:07.690492-00:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 has become unreachable.'
2022-07-31T13:03:07.690504-00:00 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | UNREACHABLE  | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | UNREACHABLE  | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

從 node3 的角度來看，此時 node1，node2 處於 UNREACHABLE 狀態。

三個節點，只有一個節點處於 ONLINE 狀態，不滿足組複製的多數派原則。此時，node3 只能查詢，寫操作會被阻塞。

mysql> select * from slowtech.t1 where id=1;
+----+------+
| id | c1   |
+----+------+
|  1 | a    |
+----+------+
1 row in set (0.00 sec)

mysql> delete from slowtech.t1 where id=1;
阻塞中。。。

又過了 16s（這裡的 16s，實際上與 group_replication_member_expel_timeout 引數有關），node1、node2 會將 node3 驅逐出（expel）叢集。此時，叢集只有兩個節點組成。

看看 node1 的紀錄檔及叢集狀態。

2022-07-31T13:03:23.576960-00:00 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.244.30:3306'
2022-07-31T13:03:23.577091-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306 on view 16592724636525403:3.'

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | ONLINE       | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
2 rows in set (0.00 sec)

再來看看 node3 的，紀錄檔沒有新的輸出，節點狀態也沒變化。

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 207db264-0192-11ed-92c9-02001700754e | 192.168.244.10 |        3306 | UNREACHABLE  | PRIMARY     |
| 2cee229d-0192-11ed-8eff-02001700f110 | 192.168.244.20 |        3306 | UNREACHABLE  | PRIMARY     |
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ONLINE       | PRIMARY     |
+--------------------------------------+----------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

恢復網路連線

接下來我們恢復 node3 與 node1、node2 之間的網路連線。

# iptables -F

# date "+%Y-%m-%d %H:%M:%S"
2022-07-31 13:07:30

首先看看 node3 的紀錄檔

2022-07-31T13:07:30.464179-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.10:3306 is reachable again.'
2022-07-31T13:07:30.464226-00:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 192.168.244.20:3306 is reachable again.'
2022-07-31T13:07:30.464239-00:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2022-07-31T13:07:37.458761-00:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2022-07-31T13:07:37.459011-00:00 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.'
2022-07-31T13:07:37.459037-00:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2022-07-31T13:07:37.459431-00:00 31 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 31'
2022-07-31T13:07:37.459478-00:00 31 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2022-07-31T13:07:37.459811-00:00 33 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'

2022-07-31T13:07:37.465738-00:00 34 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
2022-07-31T13:07:37.496466-00:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2022-07-31T13:07:37.498813-00:00 36 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 351, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:39.653028-00:00 34 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'
2022-07-31T13:07:40.653484-00:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2022-07-31T13:07:40.653822-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'
2022-07-31T13:07:40.670530-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:40.682990-00:00 47 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2022-07-31T13:07:40.687566-00:00 47 [System] [MY-010562] [Repl] Slave I/O thread for channel 'group_replication_recovery': connected to master '[email protected]:3306',replication started in log 'FIRST' at position 4
2022-07-31T13:07:40.717851-00:00 46 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='192.168.244.20', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-07-31T13:07:40.732297-00:00 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'
2022-07-31T13:07:40.732511-00:00 53 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'

紀錄檔的輸出包括兩部分，以空格為分界線。

1. 當網路連線恢復後，node3 與 node1、node2 重新建立起了連線，發現自己已經被叢集驅逐，於是節點進入到 ERROR 狀態。

mysql> select member_id,member_host,member_port,member_state,member_role from performance_schema.replication_group_members;
+--------------------------------------+----------------+-------------+--------------+-------------+
| member_id                            | member_host    | member_port | member_state | member_role |
+--------------------------------------+----------------+-------------+--------------+-------------+
| 4cbfdc79-0192-11ed-8b01-02001701bd0a | 192.168.244.30 |        3306 | ERROR        |             |
+--------------------------------------+----------------+-------------+--------------+-------------+
1 row in set (0.00 sec)

節點進入到 ERROR 狀態，會自動設定為唯讀，即紀錄檔中看到的 super_read_only=ON。注意，ERROR 狀態的節點設定為唯讀是預設行為，與後面提到的 group_replication_exit_state_action 引數無關。

2. 如果group_replication_autorejoin_tries不為 0，對於 ERROR 狀態的節點，會自動重試，重新加入叢集（auto-rejoin）。重試的次數由 group_replication_autorejoin_tries 決定，從 MySQL 8.0.21 開始，預設為 3。重試的時間間隔是 5min。重試成功後，會進入到分散式恢復階段。

接下來看看 node1 的紀錄檔。

2022-07-31T13:07:39.555613-00:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.244.10:3306, 192.168.244.20:3306, 192.168.244.30:3306 on view 16592724636525403:4.'
2022-07-31T13:07:40.732568-00:00 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.244.30:3306 was declared online within the replication group.'

node3 又重新加入到叢集中。

故障檢測流程

結合上面的案例，我們來看看 Group Repliction 的故障檢測流程。

叢集中每個節點都會定期（每秒 1 次）向其它節點傳送心跳資訊。如果在 5s 內（固定值，無引數調整）沒有收到其它節點的心跳資訊，則會將該節點標記為可疑節點，同時會將該節點的狀態設定為 UNREACHABLE 。如果叢集中有等於或超過 1/2 的節點顯示為 UNREACHABLE ，則該叢集不能對外提供寫服務。
如果在group_replication_member_expel_timeout（從 MySQL 8.0.21 開始，該引數的預設值為 5，單位 s，最大可設定值為3600，即 1 小時）時間內，可疑節點恢復正常，則會直接應用 XCom Cache 中的訊息。XCom Cache 的大小由group_replication_message_cache_size 決定，預設是 1G。
如果在group_replication_member_expel_timeout時間內，可疑節點沒有恢復正常，則會被驅逐出叢集。
而少數派節點呢，不會自動離開叢集，它會一直維持當前的狀態，直到：

網路恢復正常。
達到 group_replication_unreachable_majority_timeout 的限制。注意，該引數的起始計算時間是連線斷開 5s 之後，不是可疑節點被驅逐出叢集的時間。該引數預設為 0。

無論哪種情況，都會觸發：

節點狀態從 ONLINE 切換到 ERROR 。

回滾當前被阻塞的寫操作。

mysql> delete from slowtech.t1 where id=1;
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

ERROR 狀態的節點會自動設定為唯讀。
如果group_replication_autorejoin_tries不為 0，對於 ERROR 狀態的節點，會自動重試，重新加入叢集（auto-rejoin）。
如果group_replication_autorejoin_tries為 0 或重試失敗，則會執行 group_replication_exit_state_action 指定的操作。可選的操作有：

READ_ONLY：唯讀模式。在這種模式下，會將 super_read_only 設定為 ON。預設值。
OFFLINE_MODE：離線模式。在這種模式下，會將 offline_mode 和 super_read_only 設定為 ON，此時，只有CONNECTION_ADMIN（SUPER）許可權的使用者才能登陸，普通使用者不能登入。
```
# mysql -h 192.168.244.3. -P 3306 -ut1 -p123456
ERROR 3032 (HY000): The server is currently in offline mode
```
ABORT_SERVER：關閉範例。

XCom Cache

XCom Cache 是 XCom 使用的訊息快取，用來快取叢集節點之間交換的訊息。快取的訊息是共識協定的一部分。如果網路不穩定，可能會出現節點失聯的情況。

如果節點在一定時間（由 group_replication_member_expel_timeout 決定）內恢復正常，它會首先應用 XCom Cache 中的訊息。如果 XCom Cache 沒有它需要的所有訊息，這個節點會被驅逐出叢集。驅逐出叢集後，如果 group_replication_autorejoin_tries 不為 0，它會重新加入叢集（auto-rejoin）。

重新加入叢集會使用 Distributed Recovery 補齊差異資料。相比較直接使用 XCom Cache 中的訊息，通過 Distributed Recovery 加入叢集需要的時間相對較長，過程也較複雜，並且叢集的效能也會受到影響。

所以，我們在設定 XCom Cache 的大小時，需預估 group_replication_member_expel_timeout + 5s 這段時間內的記憶體使用量。如何預估，後面會介紹相關的系統表。

下面我們模擬下 XCom Cache 不足的場景。

1. 將group_replication_message_cache_size調整為最小值（128 MB），重啟組複製使其生效。

mysql> set global group_replication_message_cache_size=134217728;
Query OK, 0 rows affected (0.00 sec)

mysql> stop group_replication;
Query OK, 0 rows affected (4.15 sec)

mysql> start group_replication;
Query OK, 0 rows affected (3.71 sec)

2. 將group_replication_member_expel_timeout調整為 3600。這樣，我們才有充足的時間進行測試。

mysql> set global group_replication_member_expel_timeout=3600;
Query OK, 0 rows affected (0.01 sec)

3. 斷開 node3 與node1、node2 之間的網路連線。

# iptables -A INPUT  -p tcp -s 192.168.244.10 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.10 -j DROP

# iptables -A INPUT  -p tcp -s 192.168.244.20 -j DROP
# iptables -A OUTPUT -p tcp -d 192.168.244.20 -j DROP

4. 反覆執行大事務。

mysql> insert into slowtech.t1(c1) select c1 from slowtech.t1 limit 1000000;
Query OK, 1000000 rows affected (10.03 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

5. 觀察錯誤紀錄檔。

如果 node1 或 node2 的錯誤紀錄檔中提示以下資訊，則意味著 node3 需要的訊息已經從 XCom Cache 中逐出了。

[Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Messages that are needed to recover node 192.168.244.30:33061 have been evicted from the message  cache. Consider resizing the maximum size of the cache by  setting group_replication_message_cache_size.'

6. 檢視系統表。

除了錯誤紀錄檔，我們還可以通過系統表來判斷 XCom Cache 的使用情況。

mysql> select * from performance_schema.memory_summary_global_by_event_name where event_name like "%GCS_XCom::xcom_cache%"\G
*************************** 1. row ***************************
                  EVENT_NAME: memory/group_rpl/GCS_XCom::xcom_cache
                 COUNT_ALLOC: 23678
                  COUNT_FREE: 22754
   SUM_NUMBER_OF_BYTES_ALLOC: 154713397
    SUM_NUMBER_OF_BYTES_FREE: 28441492
              LOW_COUNT_USED: 0
          CURRENT_COUNT_USED: 924
             HIGH_COUNT_USED: 20992
    LOW_NUMBER_OF_BYTES_USED: 0
CURRENT_NUMBER_OF_BYTES_USED: 126271905
   HIGH_NUMBER_OF_BYTES_USED: 146137294
1 row in set (0.00 sec)

其中，

COUNT_ALLOC：快取過的訊息數量。
COUNT_FREE：從快取中刪除的訊息數量。
CURRENT_COUNT_USED：當前正在快取的訊息數量，等於 COUNT_ALLOC - COUNT_FREE。
SUM_NUMBER_OF_BYTES_ALLOC：分配的記憶體大小。
SUM_NUMBER_OF_BYTES_FREE：被釋放的記憶體大小。
CURRENT_NUMBER_OF_BYTES_USED：當前正在使用的記憶體大小，等於 SUM_NUMBER_OF_BYTES_ALLOC - SUM_NUMBER_OF_BYTES_FREE。
LOW_COUNT_USED，HIGH_COUNT_USED：CURRENT_COUNT_USED 的歷史最小值和最大值。
LOW_NUMBER_OF_BYTES_USED，HIGH_NUMBER_OF_BYTES_USED：CURRENT_NUMBER_OF_BYTES_USED 的歷史最小值和最大值。

如果斷開連線之後，在反覆執行大事務的過程中，發現 COUNT_FREE 發生了變化，同樣意味著 node3 需要的訊息已經從 XCom Cache 中驅逐了。

7. 恢復 node3 與 node1、node2 之間的網路連線。

在 group_replication_member_expel_timeout 期間，網路恢復了，而 node3 需要的訊息在 XCom Cache 中不存在了，則 node3 同樣會被驅逐出叢集。以下是這種場景下 node3 的錯誤紀錄檔。

[ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Node 0 is unable to get message {4aec99ca 7562 0}, since the group is too far ahead. Node will now exit.'
[ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
[ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
[System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
[System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'

注意事項

如果叢集中存在 UNREACHABLE 的節點，會有以下限制和不足：

不能調整叢集的拓撲，包括新增和刪除節點。
在單主模式下，如果 Primary 節點出現故障了，無法選擇新主。
如果 Group Replication 的一致性級別等於 AFTER 或 BEFORE_AND_AFTER，則寫操作會一直等待，直到 UNREACHABLE 節點 ONLINE 並應用該操作。
叢集吞吐量會下降。如果是單主模式，可將 group_replication_paxos_single_leader （MySQL 8.0.27 引入的）設定為 ON 解決這個問題。

所以，線上上 group_replication_member_expel_timeout 不宜設定過大。

參考資料

[1] Extending replication instrumentation: account for memory used in XCom

[2] MySQL Group Replication - Default response to network partitions has changed

[3] No Ping Will Tear Us Apart - Enabling member auto-rejoin in Group Replication