為什麼建立 Redis 叢集時會自動錯開主從節點？

哈嘍大家好，我是鹹魚

在《一臺伺服器上部署 Redis 偽叢集》這篇文章中，鹹魚在建立 Redis 叢集時並沒有明確指定哪個 Redis 範例將擔任 master，哪個將擔任 slave

/usr/local/redis-4.0.9/src/redis-trib.rb create --replicas 1 192.168.149.131:6379 192.168.149.131:26379 192.168.149.131:6380 192.168.149.131:26380 192.168.149.131:6381 192.168.149.131:26381

然而 Redis 卻自動完成了主從節點的分配工作

如果大家在多臺伺服器部署過 Redis 叢集的話，比如說在三臺機器上部署三主三從的 redis 叢集，你會觀察到 Redis 自動地將主節點和從節點的部署位置錯開

舉個例子： master 1 和 slave 3 在同一臺機器上； master 2和 slave 1 在同一臺機器上； master 3 和 slave 2 在同一臺機器上

這是為什麼呢？

我們知道老版本的 Redis 叢集管理命令是 redis-trib.rb，新版本則換成了 redis-cli

這兩個可執行檔案其實是一個用 C 編寫的指令碼，小夥伴們如果看過這兩個檔案的原始碼就會發現原因就在下面這段程式碼裡

/* Return the anti-affinity score, which is a measure of the amount of
 * violations of anti-affinity in the current cluster layout, that is, how
 * badly the masters and slaves are distributed in the different IP
 * addresses so that slaves of the same master are not in the master
 * host and are also in different hosts.
 *
 * The score is calculated as follows:
 *
 * SAME_AS_MASTER = 10000 * each slave in the same IP of its master.
 * SAME_AS_SLAVE  = 1 * each slave having the same IP as another slave
                      of the same master.
 * FINAL_SCORE = SAME_AS_MASTER + SAME_AS_SLAVE
 *
 * So a greater score means a worse anti-affinity level, while zero
 * means perfect anti-affinity.
 *
 * The anti affinity optimization will try to get a score as low as
 * possible. Since we do not want to sacrifice the fact that slaves should
 * not be in the same host as the master, we assign 10000 times the score
 * to this violation, so that we'll optimize for the second factor only
 * if it does not impact the first one.
 *
 * The ipnodes argument is an array of clusterManagerNodeArray, one for
 * each IP, while ip_count is the total number of IPs in the configuration.
 *
 * The function returns the above score, and the list of
 * offending slaves can be stored into the 'offending' argument,
 * so that the optimizer can try changing the configuration of the
 * slaves violating the anti-affinity goals. */
static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
	...
    return score;
}

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes,
    int ip_count)
{
	...
}

通過註釋我們可以得知，clusterManagerGetAntiAffinityScore 函數是用來計算反親和性得分，這個得分表示了當前 Redis 叢集佈局中是否符合反親和性的要求

反親和性指的是 master 和 slave 不應該在同一臺機器上，也不應該在相同的 IP 地址上

那如何計算反親和性得分呢？

如果有多個 slave 與同一個 master 在相同的 IP 地址上，那麼對於每個這樣的 slave，得分增加 10000
如果有多個 slave 在相同的 IP 地址上，但它們彼此之間不是同一個 master，那麼對於每個這樣的 slave，得分增加 1
最終得分是上述兩部分得分之和

也就是說，得分越高，親和性越高；得分越低，反親和性越高；得分為零表示完全符合反親和性的要求

獲得得分之後，就會對得分高（反親和性低）的節點進行優化

為了讓 Redis 主從之間的反親和性更高，clusterManagerOptimizeAntiAffinity 函數會對那些反親和性很低的節點進行優化，它會嘗試通過交換從節點的主節點，來改善叢集中主從節點分佈，從而減少反親和性低問題

接下來我們分別來看下這兩個函數

反親和性得分計算

static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
	...
}

可以看到，該函數接受了四個引數：

ipnodes：一個包含多個 clusterManagerNodeArray 結構體的陣列，每個結構體表示一個 IP 地址上的節點陣列
ip_count：IP 地址的總數
offending：用於儲存違反反親和性規則的節點的指標陣列（可選引數）
offending_len：儲存 offending 陣列中節點數量的指標（可選引數）

第一層 for 迴圈是遍歷 ip 地址，第二層迴圈是遍歷每個 IP 地址的節點陣列

    ...
    for (i = 0; i < ip_count; i++) {
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
			...
        }
    ...

我們來看下第二層 for 迴圈

    for (i = 0; i < ip_count; i++) {
        /* 獲取每個 IP 地址的節點陣列 */
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        /* 建立字典 related */
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
            /* 獲取當前節點 */
            clusterManagerNode *node = node_array->nodes[j];
			...
            /* 在 related 字典中查詢是否已經存在相應的鍵 */
            dictEntry *entry = dictFind(related, key);
            if (entry) types = sdsdup((sds) dictGetVal(entry));
            else types = sdsempty();
            if (node->replicate) types = sdscat(types, "s");
            else {
                sds s = sdscatsds(sdsnew("m"), types);
                sdsfree(types);
                types = s;
            }
            dictReplace(related, key, types);
        }

首先遍歷每個 IP 地址的節點陣列，對於每個 IP 地址上的節點陣列，函數通過字典related來記錄相同主節點和從節點的關係

其中字典 related的 key 是節點的名稱，value 是一個字串，表示該節點型別 types

對於每個節點，根據節點構建一個字串型別的關係標記（types），將主節點標記為 m，從節點標記為 s

然後通過字典將相同關係標記的節點關聯在一起，構建了一個記錄相同主從節點關係的字典 related

	...     
		/* 建立字典迭代器，用於遍歷節點分組資訊 */
		dictIterator *iter = dictGetIterator(related);
        dictEntry *entry;
        while ((entry = dictNext(iter)) != NULL) {
            /* key 是節點名稱，value 是 types，即節點型別 */
            sds types = (sds) dictGetVal(entry);
            sds name = (sds) dictGetKey(entry);
            int typeslen = sdslen(types);
            if (typeslen < 2) continue;
            /* 計算反親和性得分 */
            if (types[0] == 'm') score += (10000 * (typeslen - 1));
            else score += (1 * typeslen);
            ...
        }

上面程式碼片段可知，while 迴圈遍歷字典 related中的分組資訊，計算相同主從節點關係的得分

獲取節點型別資訊並長度
如果是主節點型別，得分 += (10000 * (typeslen - 1))；否則，得分 += (1 * typeslen)

如果有提供 offending 引數，將找到違反反親和性規則的節點並儲存到 offending 陣列中，同時更新違反規則節點的數量，如下程式碼所示

            if (offending == NULL) continue;
            /* Populate the list of offending nodes. */
            listIter li;
            listNode *ln;
            listRewind(cluster_manager.nodes, &li);
            while ((ln = listNext(&li)) != NULL) {
                clusterManagerNode *n = ln->value;
                if (n->replicate == NULL) continue;
                if (!strcmp(n->replicate, name) && !strcmp(n->ip, ip)) {
                    *(offending_p++) = n;
                    if (offending_len != NULL) (*offending_len)++;
                    break;
                }
            }

最後返回得分 score，完整函數程式碼如下

static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
    int score = 0, i, j;
    int node_len = cluster_manager.nodes->len;
    clusterManagerNode **offending_p = NULL;
    if (offending != NULL) {
        *offending = zcalloc(node_len * sizeof(clusterManagerNode*));
        offending_p = *offending;
    }
    /* For each set of nodes in the same host, split by
     * related nodes (masters and slaves which are involved in
     * replication of each other) */
    for (i = 0; i < ip_count; i++) {
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
            clusterManagerNode *node = node_array->nodes[j];
            if (node == NULL) continue;
            if (!ip) ip = node->ip;
            sds types;
            /* We always use the Master ID as key. */
            sds key = (!node->replicate ? node->name : node->replicate);
            assert(key != NULL);
            dictEntry *entry = dictFind(related, key);
            if (entry) types = sdsdup((sds) dictGetVal(entry));
            else types = sdsempty();
            /* Master type 'm' is always set as the first character of the
             * types string. */
            if (node->replicate) types = sdscat(types, "s");
            else {
                sds s = sdscatsds(sdsnew("m"), types);
                sdsfree(types);
                types = s;
            }
            dictReplace(related, key, types);
        }
        /* Now it's trivial to check, for each related group having the
         * same host, what is their local score. */
        dictIterator *iter = dictGetIterator(related);
        dictEntry *entry;
        while ((entry = dictNext(iter)) != NULL) {
            sds types = (sds) dictGetVal(entry);
            sds name = (sds) dictGetKey(entry);
            int typeslen = sdslen(types);
            if (typeslen < 2) continue;
            if (types[0] == 'm') score += (10000 * (typeslen - 1));
            else score += (1 * typeslen);
            if (offending == NULL) continue;
            /* Populate the list of offending nodes. */
            listIter li;
            listNode *ln;
            listRewind(cluster_manager.nodes, &li);
            while ((ln = listNext(&li)) != NULL) {
                clusterManagerNode *n = ln->value;
                if (n->replicate == NULL) continue;
                if (!strcmp(n->replicate, name) && !strcmp(n->ip, ip)) {
                    *(offending_p++) = n;
                    if (offending_len != NULL) (*offending_len)++;
                    break;
                }
            }
        }
        //if (offending_len != NULL) *offending_len = offending_p - *offending;
        dictReleaseIterator(iter);
        dictRelease(related);
    }
    return score;
}

反親和性優化

計算出反親和性得分之後，對於那些得分很低的節點，redis 就需要對其進行優化，提高叢集中節點的分佈，以避免節點在同一主機上

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes, int ip_count){	
    clusterManagerNode **offenders = NULL;
    int score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count,
                                                   NULL, NULL);
    if (score == 0) goto cleanup;  
    ...
cleanup:
    zfree(offenders);
}

從上面的程式碼可以看到，如果得分為 0 ，說明反親和性已經很好，無需優化。直接跳到 cleanup 去釋放 offenders 節點的記憶體空間

如果得分不為 0 ，則就會設定一個最大迭代次數maxiter，這個次數跟節點的數量成正比，然後 while 迴圈在有限次迭代內進行優化操作

    ...
    int maxiter = 500 * node_len; // Effort is proportional to cluster size...
    while (maxiter > 0) {
    	...
        maxiter--;
    }
    ...

這個函數的核心就在 while 迴圈裡，我們來看一下其中的一些片段

首先 offending_len 來儲存違反規則的節點數，然後如果之前有違反規則的節點(offenders != NULL)則釋放掉（zfree(offenders)）

然後重新計算得分，如果得分為0或沒有違反規則的節點，退出 while 迴圈

    int offending_len = 0;  
    if (offenders != NULL) {
        zfree(offenders);  // 釋放之前儲存的違反規則的節點
        offenders = NULL;
    }
    score = clusterManagerGetAntiAffinityScore(ipnodes,
                                               ip_count,
                                               &offenders,
                                               &offending_len);
    if (score == 0 || offending_len == 0) break;

接著去隨機選擇一個違反規則的節點，嘗試交換分配的 master

        int rand_idx = rand() % offending_len;
        clusterManagerNode *first = offenders[rand_idx],
                           *second = NULL;

		// 建立一個陣列，用來儲存其他可交換 master 的 slave
        clusterManagerNode **other_replicas = zcalloc((node_len - 1) *
                                                      sizeof(*other_replicas));

然後遍歷叢集中的節點，尋找能夠交換 master 的 slave。如果沒有找到，那就退出迴圈

    while ((ln = listNext(&li)) != NULL) {
        clusterManagerNode *n = ln->value;
        if (n != first && n->replicate != NULL)
            other_replicas[other_replicas_count++] = n;
    }
    
    if (other_replicas_count == 0) {
        zfree(other_replicas);
        break;
    }

如果找到了，就開始交換並計算交換後的反親和性得分

    // 隨機選擇一個可交換的節點作為交換目標
    rand_idx = rand() % other_replicas_count;
    second = other_replicas[rand_idx];
    
    // 交換兩個 slave 的 master 分配
    char *first_master = first->replicate,
         *second_master = second->replicate;
    first->replicate = second_master, first->dirty = 1;
    second->replicate = first_master, second->dirty = 1;
    
    // 計算交換後的反親和性得分
    int new_score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                       ip_count,
                                                       NULL, NULL);

如果交換後的得分比之前的得分還大，說明白交換了，還不如不交換，就會回顧；如果交換後的得分比之前的得分小，說明交換是沒毛病的

    if (new_score > score) {
        first->replicate = first_master;
        second->replicate = second_master;
    }

最後釋放資源，準備下一次 while 迴圈

    zfree(other_replicas);
    maxiter--;

總結一下：

每次 while 迴圈會嘗試隨機選擇一個違反反親和性規則的從節點，並與另一個隨機選中的從節點交換其主節點分配，然後重新計算交換後的反親和性得分
如果交換後的得分變大，說明交換不利於反親和性，會回滾交換
如果交換後得分變小，則保持，後面可能還需要多次交換
這樣，通過多次隨機的交換嘗試，最終可以達到更好的反親和性分佈

最後則是一些收尾工作，像輸出紀錄檔資訊，釋放記憶體等，這裡不過多介紹

    char *msg;
    int perfect = (score == 0);
    int log_level = (perfect ? CLUSTER_MANAGER_LOG_LVL_SUCCESS :
                               CLUSTER_MANAGER_LOG_LVL_WARN);
    if (perfect) msg = "[OK] Perfect anti-affinity obtained!";
    else if (score >= 10000)
        msg = ("[WARNING] Some slaves are in the same host as their master");
    else
        msg=("[WARNING] Some slaves of the same master are in the same host");
    clusterManagerLog(log_level, "%s\n", msg);

下面是完整程式碼

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes,
    int ip_count)
{
    clusterManagerNode **offenders = NULL;
    int score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count,
                                                   NULL, NULL);
    if (score == 0) goto cleanup;
    clusterManagerLogInfo(">>> Trying to optimize slaves allocation "
                          "for anti-affinity\n");
    int node_len = cluster_manager.nodes->len;
    int maxiter = 500 * node_len; // Effort is proportional to cluster size...
    srand(time(NULL));
    while (maxiter > 0) {
        int offending_len = 0;
        if (offenders != NULL) {
            zfree(offenders);
            offenders = NULL;
        }
        score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                   ip_count,
                                                   &offenders,
                                                   &offending_len);
        if (score == 0 || offending_len == 0) break; // Optimal anti affinity reached
        /* We'll try to randomly swap a slave's assigned master causing
         * an affinity problem with another random slave, to see if we
         * can improve the affinity. */
        int rand_idx = rand() % offending_len;
        clusterManagerNode *first = offenders[rand_idx],
                           *second = NULL;
        clusterManagerNode **other_replicas = zcalloc((node_len - 1) *
                                                      sizeof(*other_replicas));
        int other_replicas_count = 0;
        listIter li;
        listNode *ln;
        listRewind(cluster_manager.nodes, &li);
        while ((ln = listNext(&li)) != NULL) {
            clusterManagerNode *n = ln->value;
            if (n != first && n->replicate != NULL)
                other_replicas[other_replicas_count++] = n;
        }
        if (other_replicas_count == 0) {
            zfree(other_replicas);
            break;
        }
        rand_idx = rand() % other_replicas_count;
        second = other_replicas[rand_idx];
        char *first_master = first->replicate,
             *second_master = second->replicate;
        first->replicate = second_master, first->dirty = 1;
        second->replicate = first_master, second->dirty = 1;
        int new_score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                           ip_count,
                                                           NULL, NULL);
        /* If the change actually makes thing worse, revert. Otherwise
         * leave as it is because the best solution may need a few
         * combined swaps. */
        if (new_score > score) {
            first->replicate = first_master;
            second->replicate = second_master;
        }
        zfree(other_replicas);
        maxiter--;
    }
    score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count, NULL, NULL);
    char *msg;
    int perfect = (score == 0);
    int log_level = (perfect ? CLUSTER_MANAGER_LOG_LVL_SUCCESS :
                               CLUSTER_MANAGER_LOG_LVL_WARN);
    if (perfect) msg = "[OK] Perfect anti-affinity obtained!";
    else if (score >= 10000)
        msg = ("[WARNING] Some slaves are in the same host as their master");
    else
        msg=("[WARNING] Some slaves of the same master are in the same host");
    clusterManagerLog(log_level, "%s\n", msg);
cleanup:
    zfree(offenders);
}