分散式資料庫 Join 查詢設計與實現淺析

相對於單例資料庫的查詢操作，分散式資料查詢會有很多技術難題。

本文記錄 Mysql 分庫分表和 Elasticsearch Join 查詢的實現思路，瞭解分散式場景資料處理的設計方案。
文章從常用的關係型資料庫 MySQL 的分庫分表Join 分析，再到非關係型 ElasticSearch 來分析 Join 實現策略。逐步深入Join 的實現機制。

①Mysql 分庫分表 Join 查詢場景

分庫分表場景下，查詢語句如何分發，資料如何組織。相較於NoSQL 資料庫，Mysql 在SQL 規範的範圍內，相對比較容易適配分散式場景。

基於 sharding-jdbc 中介軟體的方案，瞭解整個設計思路。

sharding-jdbc

sharding-jdbc 代理了原始的 datasource, 實現 jdbc 規範來完成分庫分表的分發和組裝，應用層無感知。
執行流程：SQL解析 => 執行器優化 => SQL路由 => SQL改寫 => SQL執行 => 結果歸併 io.shardingsphere.core.executor.ExecutorEngine#execute
Join 語句的解析，決定了要分發 SQL 到哪些範例節點上。對應SQL路由。
SQL 改寫就是要把原始（邏輯）表名，改為實際分片的表名。
複雜情況下，Join 查詢分發的最多執行的次數 = 資料庫範例 × 表A分片數 × 表B分片數

Code Insight

範例程式碼工程：[email protected]:cluoHeadon/sharding-jdbc-demo.git

/**
 * 執行查詢 SQL 切入點，從這裡可以完整 debug 執行流程
 * @see ShardingPreparedStatement#execute()
 * @see ParsingSQLRouter#route(String, List, SQLStatement) Join 查詢實際涉及哪些表，就是在路由規則裡匹配得出來的。
 */
public boolean execute() throws SQLException {
    try {
        // 根據引數（決定分片）和具體的SQL 來匹配相關的實際 Table。
        Collection<PreparedStatementUnit> preparedStatementUnits = route();
        // 使用執行緒池，分發執行和結果歸併。
        return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
    } finally {
        JDBCShardingRefreshHandler.build(routeResult, connection).execute();
        clearBatch();
    }
}

SQL 路由策略

啟用 sql 列印，直觀看到實際分發執行的 SQL

# 列印的程式碼，就是在上述route 得出 ExecutionUnits 後，列印的
sharding.jdbc.config.sharding.props.sql.show=true

sharding-jdbc 根據不同的SQL 語句，會有不同的路由策略。我們關注的 Join 查詢，實際相關就是以下兩種策略。

StandardRoutingEngine binding-tables 模式
ComplexRoutingEngine 最複雜的情況，笛卡爾組合關聯關係。

-- 引數不明，不能定位分片的情況
select * from order o inner join order_item oi on o.order_id = oi.order_id 

-- 路由結果
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id

②Elasticsearch Join 查詢場景

首先，對於 NoSQL 資料庫，要求 Join 查詢，可以考慮是不是使用場景和用法有問題。

然後，不可避免的，有些場景需要這個功能。Join 查詢的實現更貼近SQL 引擎。

基於 elasticsearch-sql 元件的方案，瞭解大概的實現思路。

elasticsearch-sql

這是個elasticsearch 外掛，通過提供http 服務實現類 SQL 查詢的功能，高版本的elasticsearch 已經具備該功能⭐
因為 elasticsearch 沒有 Join 查詢的特性，所以實現 SQL Join 功能，需要提供更加底層的功能，涉及到 Join 演演算法。

Code Insight

原始碼地址：[email protected]:NLPchina/elasticsearch-sql.git

/**
 * Execute the ActionRequest and returns the REST response using the channel.
 * @see ElasticDefaultRestExecutor#execute
 * @see ESJoinQueryActionFactory#createJoinAction Join 演演算法選擇
 */
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception{
    // sql parse
    SqlElasticRequestBuilder requestBuilder = queryAction.explain();

    // join 查詢
    if(requestBuilder instanceof JoinRequestBuilder){
        // join 演演算法選擇。包括：HashJoinElasticExecutor、NestedLoopsElasticExecutor
        // 如果關聯條件為等值（Condition.OPEAR.EQ）,則使用 HashJoinElasticExecutor
        ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client,requestBuilder);
        executor.run();
        executor.sendResponse(channel);
    }
    // 其他型別查詢 ...
}

③More Than Join

Join 演演算法

常用三種 Join 演演算法：Nested Loop Join，Hash Join、 Merge Join
MySQL 只支援 NLJ 或其變種，8.0.18 版本後支援 Hash Join
NLJ 相當於兩個巢狀迴圈，用第一張表做 Outter Loop，第二張表做 Inner Loop，Outter Loop 的每一條記錄跟 Inner Loop 的記錄作比較，最終符合條件的就將該資料記錄。
Hash Join 分為兩個階段; build 構建階段和 probe 探測階段。
可以使用Explain 檢視 MySQL 使用哪種 Join 演演算法。需要的語法關鍵字： FORMAT=JSON or FORMAT=Tree

EXPLAIN FORMAT=JSON  
SELECT * FROM
    sale_line_info u
    JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code;

{
    "query_block": {
        "select_id": 1,
        // 使用的join 演演算法： nested_loop
        "nested_loop": [
            // 涉及join 的表以及對應的 key,其他的資訊與常用explain 類似
            {
                "table": {
                    "table_name": "o",
                    "access_type": "ALL"
                }
            },
            {
                "table": {
                    "table_name": "u",
                    "access_type": "ref"
                }
            }
        ]
    }
}

Elasticsearch Nested型別

分析Elasticsearch 業務資料以及使用場景，還有一種選擇是直接儲存關聯資訊的檔案。在 Elasticsearch 中，是以完整檔案形式提供查詢和檢索，徹底避開使用 Join 相關的技術。

這樣就牽扯到關聯是歸屬型別的資料還是公用型別的資料、關聯資料量的大小、關聯資料的更新頻率等。這些都是使用 Nested 型別需要考慮的因素。

更多的使用方法，可以從網上和官網找到，不做贅述。
我們現在有個業務功能正好使用到 Nested型別，在查詢和優化過程中，解決了非常大的難題。

總結

通過執行原理分析，對於執行流程有了清晰和深入的認知。

對於中介軟體的優化和技術選型更加有目的性，使用上會更加謹慎和小心。

明確的篩選條件，更小的篩選範圍，limit 取值資料，都可以減少計算陳本，提高效能。

參考

作者：京東物流楊攀

來源：京東雲開發者社群