1.概述

本篇部落格將對MySQL、InfluxDB、Clickhouse在寫入時間、聚合查詢時間、磁碟使用等方面的效能指標來進行比較。

2.內容

比較的資料集，是使用的Clickhouse官網提供的6600萬的資料集來進行測試比較的，當MySQL、InfluxDB、Clickhouse也分配4CPU和16GB記憶體的資源時，Clickhouse完全是在匯入速度、磁碟使用和查詢效能等方面體現非常好的效果。結論如下所示：

	MySQL	InfluxDB	Clickhouse
匯入時間	70分鐘	35分鐘	70秒
磁碟佔用空間	12.35GB	5.9GB	2.66GB
全表count	24366ms	11674ms	100ms
全表max/min	27023ms	26829ms	186ms
全表求平均值	24841ms	12043ms	123ms
全表求方差	24600ms	OOM	113ms
複雜查詢1	30260ms	OOM	385ms
複雜查詢2	470ms	200ms	8ms

為了保證測試結果比較準確，上面的每條SQL至少執行10次，然後取中間值。其中InfluxDB的效能比預期差，甚至比MySQL差，這可能是由於資料樣本和測試用例不合適InfluxDB場景造成的。

2.1 MySQL

MySQL是Oracle的產品，是目前最流行的關係型資料庫管理系統之一。它使用的SQL語言是用於存取資料庫的最常見的標準化語言。採用雙授權策略，分為社群版和商業版。由於體積小、速度快、總體擁有成本低，尤其是其開源特性，一般選擇MySQL作為中小型網站開發的網站資料庫。

MySQL並不完美，但是它足夠靈活，是架構中的最佳選擇之一，並且在複雜的非單一專案中總能擁有一席之地。

2.2 InfluxDB

InfluxDB是InfluxData開發的開源時序資料庫，專注於海量時序資料的高效能讀、高效能寫、高效儲存和實時分析。在資料庫引擎排名時序資料庫中，它位居第一，廣泛應用與開發運維監控、物聯網監控、實時分析等場景。

傳統資料庫通常記錄資料的當前值，而時序資料庫記錄所有歷史資料。在處理當前時序資料時，必須不斷接收新的時序資料。同時，時序資料的查詢始終是基於時間的。它重點解決以下海量資料場景：

時序資料的寫入：如何支援每條千萬條資料的寫入；
時序資料的讀取：如何支援每條千萬條資料的聚合查詢；
成本問題：海量資料儲存帶來的成本問題，如何以更低的成本儲存這些資料。

2.3 Clickhouse

Clickhouse是由Yandex開源的基於列儲存的資料庫，用於實時資料分析，其處理資料的速度比傳統方法快100~1000倍。Clickhouse優於當前市場上類似的面向列的DBMS，每臺伺服器每秒處理數億到超過10億行和超過10GB的資料。

它是一個用於線上分析（OLAP）的列式資料庫管理系統（DBMS），對OLTP和OLAP的做如下區別介紹：

OLTP：它是一個傳統的關係型資料庫，主要操作增刪改查，強調交易一致性，比如銀行系統、電子商務系統等；
OLAP：是一個倉庫型的資料庫，主要用於讀取資料，做複雜的資料分析，專注於技術決策支援，提供直觀簡單的結果。

Clickhouse用於OLAP的適用場景如下：

讀取多於寫入；
一個大的寬表讀取大量的行和很少的列，同時匯入較小的結果集；
資料時分批寫入的，資料不更新或者更新頻率很低；
無需事務，資料一致性要求較低；
靈活多變，不適合模型預構建（類似Kylin的Cube）

3.測試資料準備

直接使用Clickhouse提供的測試資料地址：https://clickhouse.com/docs/en/getting-started/example-datasets/opensky/，這個資料集中的資料是從完整的OpenSky資料集中匯出和清洗過的。

該資料集涵蓋了自2019年1月1日以來，該網路2500多名成員看到的所有航班資訊。

3.1 下載資料

執行如下命令：

wget -O- https://zenodo.org/record/5092942 | grep -oP 'https://zenodo.org/record/5092942/files/flightlist_\d+_\d+\.csv\.gz' | xargs wget

在網際網路連線良好的情況下，下載大約需要 2 分鐘。共有 30 個檔案，總大小為 4.3 GB。

3.2 建立表Clickhouse

執行如下SQL命令：

CREATE TABLE opensky
(
    callsign String,
    number String,
    icao24 String,
    registration String,
    typecode String,
    origin String,
    destination String,
    firstseen DateTime,
    lastseen DateTime,
    day DateTime,
    latitude_1 Float64,
    longitude_1 Float64,
    altitude_1 Float64,
    latitude_2 Float64,
    longitude_2 Float64,
    altitude_2 Float64
) ENGINE = MergeTree ORDER BY (origin, destination, callsign);

3.2.1 匯入資料

將資料匯入到Clickhouse中，執行如下所示命令：

ls -1 flightlist_*.csv.gz | xargs -P100 -I{} bash -c 'gzip -c -d "{}" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"'

在這裡，我們將檔案列表 ( ls -1 flightlist_*.csv.gz)傳遞xargs給以進行並行處理。 xargs -P100指定最多使用 100 個並行工作器，但由於我們只有 30 個檔案，工作器的數量將只有 30 個；
對於每個檔案，xargs將執行一個帶有bash -c. 該指令碼以 of 的形式進行替換，{}並且該xargs命令將檔名替換為它（我們已經要求它xargs使用-I{}）；
該指令碼會將檔案 ( gzip -c -d "{}") 解壓縮到標準輸出 (-c引數) 並將輸出重定向到clickhouse-client
我們還要求使用擴充套件解析器 ( --date_time_input_format best_effort ) 解析DateTime欄位，以識別具有時區偏移的 ISO-8601 格式

最後，clickhouse-client會做插入。它將以CSVWithNames格式讀取輸入資料。並行上傳需要 24 秒。如果不想使用並行上傳，還可以使用順序上傳，可能需要的時間長一點，大概 75 秒，具體執行命令如下：

for file in flightlist_*.csv.gz; do gzip -c -d "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO opensky FORMAT CSVWithNames"; done

完整程式碼如下：

$ clickhouse-client
$ create database test;
$ use test;
$ CREATE TABLE opensky(callsign String,number String,icao24 String,registration String,typecode String,origin String,destination String,firstseen DateTime,lastseen DateTime,day DateTime,latitude_1 Float64,longitude_1 Float64,altitude_1 Float64,latitude_2 Float64,longitude_2 Float64,altitude_2 Float64) ENGINE = MergeTree ORDER BY (origin, destination, callsign);
$ exit

# Import data (about 75 seconds)
$ cd /tmp/flightlist
$ for file in flightlist_*.csv; do cat "$file" | clickhouse-client --date_time_input_format best_effort --query "INSERT INTO test.opensky FORMAT CSVWithNames"; done

# Check if the data was imported successfully
$ clickhouse-client
$ SELECT count() FROM test.opensky;

3.3 建立MySQL

完整程式碼如下：

# Link MySQL to build database and table
$ mysql -uroot -p123456
$ use test;
$ CREATE TABLE `opensky` (`callsign` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`number` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`icao24` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`registration` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`typecode` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`origin` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`destination` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,`firstseen` datetime DEFAULT NULL,`lastseen` datetime DEFAULT NULL,`day` datetime DEFAULT NULL,`latitude_1` double DEFAULT NULL,`longitude_1` double DEFAULT NULL,`altitude_1` double DEFAULT NULL,`latitude_2` double DEFAULT NULL,`longitude_2` double DEFAULT NULL,`altitude_2` double DEFAULT NULL,KEY `idx_callsign` (`callsign`),KEY `idx_origin` (`origin`),KEY `idx_destination` (`destination`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

# Import data (about 70 minutes)
$ load data local infile 'flightlist_20190101_20190131.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;
# Omit the other 29 import commands：load data local infile 'flightlist_*_*.csv' into table opensky character set utf8mb4 fields terminated by ',' lines terminated by '\n' ignore 1 lines;

# Check if the data was imported successfully
$ select count(*) from test.opensky;

3.4 建立InfluxDB

完整程式碼如下：

# Import data (about 30 minutes)
$ influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_20190101_20190131.txt -precision=ns;
# Omit the other 29 import commands：influx -username 'admin' -password 'admin123456' -import -path=/tmp/flightlist/flightlist_*_*.txt -precision=ns;

# Check if the data was imported successfully
$ influx -username 'admin' -password 'admin123456'
$ select count(latitude_1) from test.autogen.opensky;

4.測試場景

4.1 MySQL測試維度

從不同的維度，對MySQL來進行測試，具體實現程式碼如下所示：

$ mysql -uroot -p123456
$ use test;
-- Enable performance analysis
set profiling = 1;
-- query disk space
select table_rows as `total_lines`, (data_length + index_length)/1024/1024/1024 as `disk_usage(G)` from information_schema.`TABLES` where table_name = 'opensky';
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select avg(latitude_2) from opensky;
-- full table variance
select var_pop(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT origin, count(1) AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;
-- output analysis results
show profiles;

4.2 InfluxDB測試維度

從不同的維度，對InfluxDB來進行測試，具體實現程式碼如下所示：

$ influx -username 'admin' -password 'admin123456'
$ use test;
-- Time-consuming statistics，queryReqDurationNs is the cumulative query time, and the subtraction of the time of the two tasks is the time-consuming
select queryReq,queryReqDurationNs/1000/1000,queryRespBytes from _internal."monitor".httpd order by time desc limit 10;
-- query disk space
select sum(diskBytes) / 1024 / 1024 /1024 from _internal."monitor"."shard" where time > now() - 10s group by "database";
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select mean(latitude_2) from opensky;
-- full table variance
select stddev(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),mean(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT count(latitude_1) AS c FROM opensky WHERE origin =~/^UUEE|UUDD|UUWW$/ GROUP BY origin;

4.3 Clickhouse測試維度

從不同的維度，對Clickhouse來進行測試，具體實現程式碼如下所示：

$ clickhouse-client
$ use test;
-- Time-consuming statistics
select event_time_microseconds,query_duration_ms,read_rows,result_rows,memory_usage,query from system.query_log where query like '%opensky%' and query_duration_ms <> 0 and query not like '%event_time_microseconds%' order by event_time_microseconds desc limit 5;
-- query disk space
SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'opensky';
-- full table count
select count(latitude_1) from opensky;
-- full table max/min
select max(longitude_1),min(altitude_1) from opensky;
-- full table average
select avg(latitude_2) from opensky;
-- full table variance
select var_pop(longitude_2) from opensky;
-- Complex query 1: Aggregate query of multiple fields in the whole table
select count(latitude_1),max(longitude_1),min(altitude_1),avg(latitude_2) from opensky;
-- Complex query 2: Number of flights departing from the three main Moscow airports
SELECT origin, count() AS c FROM opensky WHERE origin IN ('UUEE', 'UUDD', 'UUWW') GROUP BY origin;

5.為什麼Clickhouse這麼快

5.1 列式儲存

資料儲存在列中，資料就是索引；
查詢只存取涉及的列，減少了系統 I/O；
每列由一個執行緒處理，有效利用 CPU 資源；
它還為向量化執行奠定了基礎。

5.2 資料壓縮

資料壓縮的本質是按照一定的步長對資料進行匹配掃描，發現重複資料時進行編碼轉換。

因為是列式儲存，資料特性非常相似，所以資料中存在很多重複，壓縮率越高，資料量越小，磁碟I/O壓力越低，網路傳輸越快。

5.3 向量化執行引擎

單指令多資料是指一條指令操作多條資料。是通過資料並行來提高效能的一種方式，可以簡單理解為程式中資料在暫存器級別的並行處理。

Clickhouse 廣泛使用 SIMD 來提高計算效率。通過使用SIMD，基本上可以帶來數倍的效能提升。

5.4 多執行緒和分散式

分散式領域有個規律，計算移動比資料移動更划算，這就是它的核心。

資料的計算直接傳送到資料所在的伺服器，進行多機並行處理，然後將最終結果彙總在一起。

此外，ClickHouse 還通過執行緒級並行進一步提高效率，充分利用伺服器資源。

5.5 各種表引擎

MergeTree 儲存結構對寫入的資料進行排序，然後有序儲存。有序儲存有兩個主要優點：

對列存檔案進行分塊壓縮時，排序鍵中的列值是連續的或重複的，這樣列存分塊中的資料才能得到最終的壓縮比。
儲存順序本身可以加快查詢的索引結構。根據排序鍵中列的等價條件或範圍條件，我們可以快速找到目標的大致位置範圍，而且這種索引結構不會產生額外的儲存開銷。

MergeTree 是 ClickHouse 表引擎中的核心引擎。其他引擎基於 MergeTree 引擎，在資料合併過程中實現不同的特性，從而形成 MergeTree 表引擎家族。

6.總結

Clickhouse的優缺點如下：

優勢：極致的查詢分析效能、低儲存成本、高吞吐資料寫入、多樣化的表引擎、完備的DBMS功能。
缺點：不支援事務，不支援真正的刪除/更新，分發能力弱；不支援高並行，官方推薦100 QPS。

對於非標準的SQL，join的實現比較特殊，效能不好；頻繁的小批次資料操作會影響查詢效能。目前還沒有可以滿足各種場景需求的OLAP引擎。本質原因是沒有一個系統可以同時在查詢效率、及時性和可維護性方面做到完美。只能說ClickHouse是為了極致的查詢效能。做了一些取捨。ClickHouse 的優缺點是顯而易見的。是否採用取決於與實際業務場景的契合度。適合你的架構是最好的架構。

7.結束語

這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行討論或傳送郵件給我，我會盡我所能為您解答，與君共勉！

另外，博主出書了《Kafka並不難學》和《Hadoop巨量資料挖掘從入門到進階實戰》，喜歡的朋友或同學，可以在公告欄那裡點選購買連結購買博主的書進行學習，在此感謝大家的支援。關注下面公眾號，根據提示，可免費獲取書籍的教學視訊。

Clickhouse基準測試實踐

1.概述

2.內容

2.1 MySQL

2.2 InfluxDB

2.3 Clickhouse

3.測試資料準備

3.1 下載資料

3.2 建立表Clickhouse

3.2.1 匯入資料

3.3 建立MySQL

3.4 建立InfluxDB

4.測試場景

4.1 MySQL測試維度

4.2 InfluxDB測試維度

4.3 Clickhouse測試維度

5.為什麼Clickhouse這麼快

5.1 列式儲存

5.2 資料壓縮

5.3 向量化執行引擎

5.4 多執行緒和分散式

5.5 各種表引擎

6.總結

7.結束語

熱愛生活，享受程式設計，與君共勉！

公眾號：

作者：哥不是小蘿莉［關於我］［犒賞］

出處：http://www.cnblogs.com/smartloli/

轉載請註明出處，謝謝合作！