不要使用短路邏輯編寫 stl sorter 多條件比較

前言

最近工期緊、任務多，沒有時間更新部落格，就水一期吧。雖然是水，也不能太水，剛好最近工作中遇到一個 sorter 多條件排序的問題，花費了半天時間來定位解決，就說說它吧。

背景

公司產品是一個跨端的資料傳輸 sdk，當下載資源時，會先從伺服器拉取一批 peer，每個 peer 是包含要下載資源分片的 p2p 端，每個節點有一個序號，sdk 根據這個值從小到大排序後依次選取 peer 進行連線，序號是由伺服器決定的，主要根據地域、連通率、頻寬等決定的，可以簡化為下面的 demo：

#include <stdio.h> 
#include <vector>
#include <string>
#include <algorithm>

struct PeerInfo
{
    std::string peerid; 
    std::string ip; 
    short port; 
    int begin; 
    int end; 
    int seq; 

    void print(void); 
};

void PeerInfo::print(void)
{
    printf ("peer %s: %d, %s:%d, %d-%d\n", 
            peerid.c_str(), seq, 
            ip.c_str(), port, begin, end); 
}

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        return lhs.seq < rhs.seq; 
    }
};

int main (void)
{
    std::vector<PeerInfo> vpi; 
    // init this vector by server response
    // ...
    std::sort (vpi.begin(), vpi.end(), PeerInfoSorter()); 
    for (auto it = vpi.begin(); it != vpi.end(); ++ it)
    {
        it->print(); 
    }
}

在下載過程中會向伺服器請求多次，每批返回的 peer 都放在一個容器中排序，但是序號是基於批的，多個批次之間的 peer 如果僅按照 seq 排序的話，就會將前面批次舊的 peer 排在前面，從而導致一些新 peer 沒有機會被用到，發生飢餓現象。

問題的產生

瞭解到這個情況後，採取了按批和序號同時排序的方案，即為 peer 增加一個 batch 欄位用於記錄批號，在排序時只有 batch 相同時才去比較 seq，程式碼類似下面這樣：

struct PeerInfo
{
    std::string peerid; 
    std::string ip; 
    short port; 
    int begin; 
    int end; 
    int seq; 
    int batch; 

    void print(void); 
};

void PeerInfo::print(void)
{
    printf ("peer %s: %d-%d, %s:%d, %d-%d\n", 
            peerid.c_str(), batch, seq, 
            ip.c_str(), port, begin, end); 
}

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        return lhs.batch < rhs.batch || lhs.seq < rhs.seq; 
    }
};

當時的想法比較直接，先比較 batch，如果 batch 已經小了就直接短路返回結果；再比較 seq。看著邏輯沒什麼問題，但是執行起來後發現還是有舊批次的 peer 排在前面，以上面那個 demo 為例，製造一些測試資料：

    // ...
    vpi.push_back ({"1a2b", "10.0.1.29", 8001, 0, 16384, 1, 1}); 
    vpi.push_back ({"3c4d", "10.0.1.30", 8002, 16384, 32768, 2, 1}); 
    vpi.push_back ({"5e6f", "10.0.1.31", 8003, 8192, 24576, 1, 2}); 
    vpi.push_back ({"7a8b", "10.0.5.22", 8081, 32768, 49152, 2, 2}); 
    vpi.push_back ({"9c0d", "10.0.5.23", 8082, 49152, 65536, 1, 3}); 
    vpi.push_back ({"afec", "10.0.5.24", 8083, 0, 65536, 2, 3});

其中最後兩個欄位分別是 seq 與 batch 的初始值。執行後輸出如下：

$ ./peer
peer 1a2b: 1-1, 10.0.1.29:8001, 0-16384
peer 5e6f: 2-1, 10.0.1.31:8003, 8192-24576
peer 9c0d: 3-1, 10.0.5.23:8082, 49152-65536
peer 3c4d: 1-2, 10.0.1.30:8002, 16384-32768
peer 7a8b: 2-2, 10.0.5.22:8081, 32768-49152
peer afec: 3-2, 10.0.5.24:8083, 0-65536

peer 1-2 排在 2-1 後面明顯不是我們想要的，那是哪裡出了問題呢？

問題的解決

看起來是 sorter 寫的有問題，重新考察一下它的邏輯：

lhs.batch < rhs.batch 時，直接返回 true 並短路後面的條件，這是正確的
lhs.batch = rhs.batch 時，結果退化為 seq 之間的比較，也是正確的
lhs.batch > rhs.batch 時，結果退化為 seq 之間的比較，問題出在這裡，此時應當直接返回 false

因此 sorter 正確的寫法應該是這樣：

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        if (lhs.batch != rhs.batch)
           return lhs.batch < rhs.batch; 

        return lhs.seq < rhs.seq; 
    }
};

前面的條件只要不相等就直接短路了，更正後輸出正常了：

$ ./peer
peer 1a2b: 1-1, 10.0.1.29:8001, 0-16384
peer 3c4d: 1-2, 10.0.1.30:8002, 16384-32768
peer 5e6f: 2-1, 10.0.1.31:8003, 8192-24576
peer 7a8b: 2-2, 10.0.5.22:8081, 32768-49152
peer 9c0d: 3-1, 10.0.5.23:8082, 49152-65536
peer afec: 3-2, 10.0.5.24:8083, 0-65536

現在回過頭來看前面錯誤的程式碼，看上去它至少保證了 batch 的順序，那麼這是真的嗎？稍微調整一下容器資料的初始順序：

    vpi.push_back ({"9c0d", "10.0.5.23", 8082, 49152, 65536, 1, 3}); 
    vpi.push_back ({"afec", "10.0.5.24", 8083, 0, 65536, 2, 3}); 
    vpi.push_back ({"1a2b", "10.0.1.29", 8001, 0, 16384, 1, 1}); 
    vpi.push_back ({"3c4d", "10.0.1.30", 8002, 16384, 32768, 2, 1}); 
    vpi.push_back ({"5e6f", "10.0.1.31", 8003, 8192, 24576, 1, 2}); 
    vpi.push_back ({"7a8b", "10.0.5.22", 8081, 32768, 49152, 2, 2});

得到的輸出打破了這一"幻覺"：

$ ./peer
peer 1a2b: 1-1, 10.0.1.29:8001, 0-16384
peer 5e6f: 2-1, 10.0.1.31:8003, 8192-24576
peer 3c4d: 1-2, 10.0.1.30:8002, 16384-32768
peer 7a8b: 2-2, 10.0.5.22:8081, 32768-49152
peer 9c0d: 3-1, 10.0.5.23:8082, 49152-65536
peer afec: 3-2, 10.0.5.24:8083, 0-65536

很顯然 1-2 排在了 2-1 之後。再分析之前的邏輯，當 lhs.batch > rhs.batch 時，結果是由 seq 決定的，所以完全可能出現上面的場景。而到底對哪些元素進行對比完全是由輸入序列和對比演演算法決定的，怎麼構造反例還真不好設計，只有當資料量大時才會表現的比較明顯。

總結

再回頭來看邏輯短路操作，如果寫成下面形式：

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        return lhs.batch < rhs.batch || lhs.seq < rhs.seq; 
    }
};

則當 lhs.batch > rhs.batch 時不會短路，從而引發錯誤。如果寫成下面的形式：

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        return lhs.batch < rhs.batch && lhs.seq < rhs.seq; 
    }
};

則當 lhs.batch < rhs.batch 時不會短路，也引發錯誤。

總結一下就是，我們需要返回 batch 或 seq 的 operator < 結果來作為比較結果，但是這個條件對於 || 和 && 在一半的情況下是不會短路的，具體而言就是：

使用 || 邏輯短路時，lhs.batch < rhs.batch 得到滿足，lhs.batch > rhs.batch 沒有得到滿足
使用 && 邏輯短路時， lhs.batch > rhs.batch 得到滿足，lhs.batch < rhs.batch 沒有得到滿足

那它們能得到全部滿足嗎？當短路發生時，lhs.batch < rhs.batch 這一條件有 true 和 false 兩種情況需要返回，而短路邏輯 || 和 && 只能允許其中一種通過，所以答案是不能。除非可以預判只有其中一種條件發生 (只返回 true 或 false)，然後有針對性的寫 || 或 && 語句，不過那樣的話這個欄位也就沒有參與比較的意義了。

最終結論就是，不要使用短路邏輯處理 sorter 多條件之間的判斷。

後記

回到前面專案中，再給它加一點需求：伺服器返回不同批次的 peer 可能重複，通過 peerid 可以去重，但當新 batch 中的 peer 在之前出現並連線過時，我們希望它的優先順序變低，以保證未連線過的 peer 不發生飢餓現象。對於這樣的需求，怎麼改呢？想必大家心中已經有了答案，現在和正確答案對比一下吧：

struct PeerInfoSorter
{
    bool operator() (PeerInfo const& lhs, PeerInfo const& rhs) { 
        if (lhs.connect_cnt != rhs.connect_cnt)
            return lhs.connect_cnt < rhs.connect_cnt;

        if (lhs.batch != rhs.batch)
           return lhs.batch < rhs.batch; 

        return lhs.seq < rhs.seq; 
    }
};

其中 connect_cnt 新欄位表示連線的次數，每連線一次增加一個計數，將這個條件放在最前面以便保證連線次數最少的 peer 排在最前面，只有當連線次數相同時 (新 peer 的 connect_cnt == 0)，才對比 batch 與 seq。

舉這個例子的目的是為了說明，sorter 多條件對比時，只要按重要性一個個排就可以了，你學會了嗎？