【深入淺出 Yarn 架構與實現】2-1 Yarn 基礎庫概述

瞭解 Yarn 基礎庫是後面閱讀 Yarn 原始碼的基礎，本節對 Yarn 基礎庫做總體的介紹。
並對其中使用的第三方庫 Protocol Buffers 和 Avro 是什麼、怎麼用做簡要的介紹。

一、主要使用的庫

Protocol Buffers：是 Google 開源的序列化庫，具有平臺無關、高效能、相容性好等優點。YARN 將其用到了 RPC 通訊中，預設情況下，YARN RPC 中所有引數採用 Protocol Buffers 進行序列化 / 反序列化。
Apache Avro：是 Hadoop 生態系統中的 RPC 框架，具有平臺無關、支援動態模式(無需編譯)等優點，Avro 的最初設計動機是解決 YARN RPC 相容性和擴充套件性差等問題。
RPC 庫：YARN 仍採用了 MRv1 中的 RPC 庫，但其中採用的預設序列化方法被替換成了 Protocol Buffers。
服務庫和事件庫 :YARN 將所有的物件服務化，以便統一管理(比建立、銷燬等)，而服務之間則採用事件機制進行通訊，不再使用類似 MRv1 中基於函數呼叫的方式。
狀態機庫：YARN 採用有限狀態機描述一些物件的狀態以及狀態之間的轉移。引入狀態機模型後，相比 MRv1， YARN 的程式碼結構更加清晰易懂。

二、第三方開源庫介紹

一）Protocol Buffers

1、簡要介紹

Protocol Buffers 是 Google 開源的一個語言無關、平臺無關的通訊協定，其小巧、高效和友好的相容性設計，使其被廣泛使用。
【可以類比 java 自帶的 Serializable 庫，功能上是一樣的。】

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

核心特點：

語言、平臺無關
簡潔
高效能
相容性好

2、安裝環境

以 mac 為例（其他平臺方式請自查）

# 1) brew安裝
brew install protobuf 

# 檢視安裝目錄
$ which protoc 
/opt/homebrew/bin/protoc 


# 2) 設定環境變數
vim ~/.zshrc

# protoc (for hadoop)
export PROTOC="/opt/homebrew/bin/protoc"

source ~/.zshrc


# 3) 檢視protobuf版本
$ protoc --version
libprotoc 3.19.1

3、寫個 demo

1）建立個 maven 工程，新增依賴

<dependencies>
  <dependency>
    <groupId>com.google.protobuf</groupId>
    <artifactId>protobuf-java</artifactId>
    <version>3.19.1</version>  <!--版本號務必和安裝的protoc版本一致-->
  </dependency>
</dependencies>

2）根目錄新建 protobuf 的訊息定義檔案 student.proto

proto 資料型別語法定義可以參考：ProtoBuf 入門教學

syntax = "proto3"; // 宣告為protobuf 3定義檔案
package tutorial;

option java_package = "com.shuofxz.learning.student";	// 生成檔案的包名
option java_outer_classname = "StudentProtos";				// 類名

message Student {								// 待描述的結構化資料
    string name = 1;
    int32 id = 2;
    optional string email = 3;	//optional 表示該欄位可以為空

    message PhoneNumber {				// 巢狀結構
        string number = 1;
        optional int32 type = 2;
    }

    repeated PhoneNumber phone = 4;	// 重複欄位
}

3）使用 protoc 工具生成訊息對應的Java類（在 proto 檔案目錄執行）

protoc -I=. --java_out=src/main/java student.proto

可以在對應的資料夾下找到 StudentProtos.java 類，裡面寫了序列化、反序列化等方法。

public class StudentExample {
    static public void main(String[] argv) {
        StudentProtos.Student Student1 = StudentProtos.Student.newBuilder()
                .setName("San Zhang")
                .setEmail("[email protected]")
                .setId(11111)
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("13911231231")
                        .setType(0))
                .addPhone(StudentProtos.Student.PhoneNumber.newBuilder()
                        .setNumber("01082345678")
                        .setType(1)).build();

        // 寫出到檔案
        try {
            FileOutputStream output = new FileOutputStream("example.txt");
            Student1.writeTo(output);
            output.close();
        } catch(Exception e) {
            System.out.println("Write Error ! ");
        }

        // 從檔案讀取
        try {
            FileInputStream input = new FileInputStream("example.txt");
            StudentProtos.Student Student2 = StudentProtos.Student.parseFrom(input);
            System.out.println("Student2:" + Student2);
        } catch(Exception e) {
            System.out.println("Read Error!");
        }
    }
}

以上就是一個 protocol buffers 使用的完整流程了。沒什麼難的，就是呼叫了一個第三方的序列化庫，將物件序列化到檔案，再反序列化讀出來。
只不過需要先在 proto 檔案中定義好資料結構，並生成對應的工具類。

4、在 Yarn 中應用

在 YARN 中，所有 RPC 函數的引數均採用 Protocol Buffers 定義的。RPC 仍使用 MRv1 中的 RPC。

二）Apache Avro

1、簡要介紹

Apache Avro 是 Hadoop 下的一個子專案。它本身既是一個序列化框架，同時也實現了 RPC 的功能。
但由於 Yarn 專案初期，Avro 還不成熟，Avro 則作為紀錄檔序列化庫使用，所有事件的序列化均採用 Avro 完成。
特點：

豐富的資料結構型別;
快速可壓縮的二進位制資料形式;
儲存持久資料的檔案容器;
提供遠端過程呼叫 RPC;
簡單的動態語言結合功能。

相比於 Apache Thrift 和 Google 的 Protocol Buffers，Apache Avro 具有以下特點:

支援動態模式。Avro 不需要生成程式碼，這有利於搭建通用的資料處理系統，同時避免了程式碼入侵。
資料無須加標籤。讀取資料前，Avro 能夠獲取模式定義，這使得 Avro 在資料編碼時只需要保留更少的型別資訊，有利於減少序列化後的資料大小。
無須手工分配的域標識。Thrift 和 Protocol Buffers 使用一個使用者新增的整型域唯一性定義一個欄位，而 Avro 則直接使用域名，該方法更加直觀、更加易擴充套件。

2、安裝環境 & demo

參考：Avro學習入門

3、在 Yarn 中應用

Apache Avro 最初是為 Hadoop 量身打造的 RPC 框架，考慮到穩定性，YARN 暫時採用 Protocol Buffers 作為序列化庫，RPC 仍使用 MRv1 中的 RPC，而 Avro 則作為紀錄檔序列化庫使用。在 YARN MapReduce 中，所有事件的序列化 / 反序列化均採用 Avro 完成，相關定義在 Events.avpr 檔案中。

三、總結

本節簡要介紹了 Yarn 中五個重要的基礎庫，瞭解這些庫會幫助瞭解 Yarn 程式碼邏輯和資料傳遞方式。
對其中兩個第三方開源庫進行了介紹。Protocol Buffers 用作 RPC 函數引數的序列化和反序列化；Avro 在紀錄檔和事件部分的序列化庫使用。