本文來自部落格園,作者:T-BARBARIANS,轉載請註明原文連結:https://www.cnblogs.com/t-bar/p/16506289.html 謝謝!
前言
上一篇探索了LZ4的壓縮和解壓效能,以及對LZ4和ZSTD的壓縮、解壓效能進行了橫向對比。文末的最後也給了一個彩蛋:任意長度的字串都可以被ZSTD、LZ4之類的壓縮演演算法壓縮得很好嗎?
本篇我們就來一探究竟。
一、通用演演算法的短字元壓縮
開門見山,我們使用一段比較短的文字:Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.
使用ZSTD與LZ4分別壓縮一下上面這段短文字。下面分別是它們的壓縮結果。
ZSTD:
LZ4:
對短文字的壓縮,zstd的壓縮率很低,lz4壓縮後的文字長度盡然超過了原有字串的長度。這是為什麼?說實話在這之前我也沒想到。
參照兩位大佬的名言:
Are you ok?
What's your problem?
二、短字串壓縮
從上面的結果可以得知,任何壓縮演演算法都有它的使用場景,並不是所有長度的字串都適合被某種演演算法壓縮。一般原因是通用壓縮演演算法維護了被壓縮字串的,用於字串還原的相關資料結構,而這些資料結構的長度超過了被壓縮短字串的自身長度。
那麼問題來了,「我真的有壓縮短字串的需求,我想體驗壓縮的極致感,怎麼辦?」。
短字元壓縮演演算法它來了。這裡挑選了3種比較優異的短字元壓縮演演算法,分別是smaz,shoco,以及壓軸的unisox2。跟前兩章一樣,還是從壓縮率,壓縮和解壓縮效能的角度,一起看看他們在短字元壓縮場景的各自表現吧。
(1)Smaz
1、Smaz的壓縮和解壓縮
1 #include <stdio.h> 2 #include <string.h> 3 #include <iostream> 4 #include "smaz.h" 5 6 using namespace std; 7 8 int main() 9 { 10 int buf_len; 11 int com_size; 12 int decom_size; 13 14 char com_buf[4096] = {0}; 15 char decom_buf[4096] = {0}; 16 17 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 18 19 buf_len = strlen(str_buf); 20 com_size = smaz_compress(str_buf, buf_len, com_buf, 4096); 21 22 cout << "text size:" << buf_len << endl; 23 cout << "compress text size:" << com_size << endl; 24 cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl; 25 26 decom_size = smaz_decompress(com_buf, com_size, decom_buf, 4096); 27 cout << "decompress text size:" << decom_size << endl; 28 29 if(strncmp(str_buf, decom_buf, buf_len)) { 30 cout << "decompress text is not equal to source text" << endl; 31 } 32 33 return 0; 34 }
執行結果如下:
通過smaz壓縮後的短字串長度為77,和源字串相比,減少了30Byte。
2、Smaz的壓縮和解壓縮效能
1 #include <stdio.h> 2 #include <string.h> 3 #include <iostream> 4 #include <sys/time.h> 5 #include "smaz.h" 6 7 using namespace std; 8 9 int main() 10 { 11 int cnt = 0; 12 int buf_len; 13 int com_size; 14 int decom_size; 15 16 timeval st, et; 17 18 char *com_ptr = NULL; 19 char* decom_ptr = NULL; 20 21 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 22 23 buf_len = strlen(str_buf); 24 gettimeofday(&st, NULL); 25 while(1) { 26 27 com_ptr = (char *)malloc(buf_len); 28 com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len); 29 30 free(com_ptr); 31 cnt++; 32 33 gettimeofday(&et, NULL); 34 if(et.tv_sec - st.tv_sec >= 10) { 35 break; 36 } 37 } 38 39 cout << endl <<"compress per second:" << cnt/10 << " times" << endl; 40 41 cnt = 0; 42 com_ptr = (char *)malloc(buf_len); 43 com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len); 44 45 gettimeofday(&st, NULL); 46 while(1) { 47 48 // decompress length not more than origin buf length 49 decom_ptr = (char *)malloc(buf_len + 1); 50 decom_size = smaz_decompress(com_ptr, com_size, decom_ptr, buf_len + 1); 51 52 // check decompress length 53 if(buf_len != decom_size) { 54 cout << "decom error" << endl; 55 } 56 57 free(decom_ptr); 58 cnt++; 59 60 gettimeofday(&et, NULL); 61 if(et.tv_sec - st.tv_sec >= 10) { 62 break; 63 } 64 } 65 66 cout << "decompress per second:" << cnt/10 << " times" << endl << endl; 67 68 free(com_ptr); 69 return 0; 70 }
結果如何?
壓縮效能在40w條/S,解壓在百萬級,好像還不錯哈!
(2)Shoco
1、Shoco的壓縮和解壓縮
1 #include <stdio.h> 2 #include <string.h> 3 #include <iostream> 4 #include "shoco.h" 5 6 using namespace std; 7 8 int main() 9 { 10 int buf_len; 11 int com_size; 12 int decom_size; 13 14 char com_buf[4096] = {0}; 15 char decom_buf[4096] = {0}; 16 17 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 18 19 buf_len = strlen(str_buf); 20 com_size = shoco_compress(str_buf, buf_len, com_buf, 4096); 21 22 cout << "text size:" << buf_len << endl; 23 cout << "compress text size:" << com_size << endl; 24 cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl; 25 26 decom_size = shoco_decompress(com_buf, com_size, decom_buf, 4096); 27 cout << "decompress text size:" << decom_size << endl; 28 29 if(strncmp(str_buf, decom_buf, buf_len)) { 30 cout << "decompress text is not equal to source text" << endl; 31 } 32 33 return 0; 34 }
執行結果如下:
通過shoco壓縮後的短字串長度為86,和源字串相比,減少了21Byte。壓縮率比smaz要低。
2、Shoco的壓縮和解壓縮效能
1 #include <stdio.h> 2 #include <string.h> 3 #include <iostream> 4 #include <sys/time.h> 5 #include "shoco.h" 6 7 using namespace std; 8 9 int main() 10 { 11 int cnt = 0; 12 int buf_len; 13 int com_size; 14 int decom_size; 15 16 timeval st, et; 17 18 char *com_ptr = NULL; 19 char* decom_ptr = NULL; 20 21 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 22 23 buf_len = strlen(str_buf); 24 gettimeofday(&st, NULL); 25 while(1) { 26 27 com_ptr = (char *)malloc(buf_len); 28 com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len); 29 30 free(com_ptr); 31 cnt++; 32 33 gettimeofday(&et, NULL); 34 if(et.tv_sec - st.tv_sec >= 10) { 35 break; 36 } 37 } 38 39 cout << endl <<"compress per second:" << cnt/10 << " times" << endl; 40 41 cnt = 0; 42 com_ptr = (char *)malloc(buf_len); 43 com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len); 44 45 gettimeofday(&st, NULL); 46 while(1) { 47 48 // decompress length not more than origin buf length 49 decom_ptr = (char *)malloc(buf_len + 1); 50 decom_size = shoco_decompress(com_ptr, com_size, decom_ptr, buf_len + 1); 51 52 // check decompress length 53 if(buf_len != decom_size) { 54 cout << "decom error" << endl; 55 } 56 57 free(decom_ptr); 58 cnt++; 59 60 gettimeofday(&et, NULL); 61 if(et.tv_sec - st.tv_sec >= 10) { 62 break; 63 } 64 } 65 66 cout << "decompress per second:" << cnt/10 << " times" << endl << endl; 67 68 free(com_ptr); 69 return 0; 70 }
執行結果如何呢?
holy shit!壓縮和解壓縮居然都達到了驚人的百萬級。就像演演算法作者們自己說的一樣:「在長字串壓縮領域,shoco不想與通用壓縮演演算法競爭,我們的優勢是短字元的快速壓縮,雖然壓縮率很爛!」。這樣說,好像也沒毛病。
(3)Unisox2
我們再來看看unisox2呢。
1、Unisox2的壓縮和解壓縮
1 #include <stdio.h> 2 #include <string.h> 3 #include "unishox2.h" 4 5 int main() 6 { 7 int buf_len; 8 int com_size; 9 int decom_size; 10 11 char com_buf[4096] = {0}; 12 char decom_buf[4096] = {0}; 13 14 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 15 16 buf_len = strlen(str_buf); 17 com_size = unishox2_compress_simple(str_buf, buf_len, com_buf); 18 19 printf("text size:%d\n", buf_len); 20 printf("compress text size:%d\n", com_size); 21 printf("compress ratio:%f\n\n", (float)buf_len / (float)com_size); 22 23 decom_size = unishox2_decompress_simple(com_buf, com_size, decom_buf); 24 25 printf("decompress text size:%d\n", decom_size); 26 27 if(strncmp(str_buf, decom_buf, buf_len)) { 28 printf("decompress text is not equal to source text\n"); 29 } 30 31 return 0; 32 }
結果如下:
通過Unisox2壓縮後的短字串長度為67,和源字串相比,減少了40Byte,相當於是打了6折啊!不錯不錯。
2、Unisox2的壓縮和解壓縮效能
Unisox2的壓縮能力目前來看是三者中最好的,如果他的壓縮和解壓效能也不錯的話,那就真的就比較完美了。再一起看看Unisox2的壓縮和解壓效能吧!
1 #include <stdio.h> 2 #include <string.h> 3 #include <malloc.h> 4 #include <sys/time.h> 5 #include "unishox2.h" 6 7 int main() 8 { 9 int cnt = 0; 10 int buf_len; 11 int com_size; 12 int decom_size; 13 14 struct timeval st, et; 15 16 char *com_ptr = NULL; 17 char* decom_ptr = NULL; 18 19 char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining."; 20 21 buf_len = strlen(str_buf); 22 gettimeofday(&st, NULL); 23 while(1) { 24 25 com_ptr = (char *)malloc(buf_len); 26 com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr); 27 28 free(com_ptr); 29 cnt++; 30 31 gettimeofday(&et, NULL); 32 if(et.tv_sec - st.tv_sec >= 10) { 33 break; 34 } 35 } 36 37 printf("\ncompress per second:%d times\n", cnt/10); 38 39 cnt = 0; 40 com_ptr = (char *)malloc(buf_len); 41 com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr); 42 43 gettimeofday(&st, NULL); 44 while(1) { 45 46 // decompress length not more than origin buf length 47 decom_ptr = (char *)malloc(buf_len + 1); 48 decom_size = unishox2_decompress_simple(com_ptr, com_size, decom_ptr); 49 50 // check decompress length 51 if(buf_len != decom_size) { 52 printf("decom error\n"); 53 } 54 55 free(decom_ptr); 56 cnt++; 57 58 gettimeofday(&et, NULL); 59 if(et.tv_sec - st.tv_sec >= 10) { 60 break; 61 } 62 } 63 64 printf("decompress per second:%d times\n\n", cnt/10); 65 66 free(com_ptr); 67 return 0; 68 }
執行結果如下:
事與願違,Unisox2雖然有三個演演算法中最好的壓縮率,可是卻也擁有最差的壓縮和解壓效能。
三、總結
本篇分享了smaz,shoco,unisox2三種短字串壓縮演演算法,分別探索了它們各自的壓縮率與壓縮和解壓縮效能,結果如下表所示。
表1
shoco的壓縮率最低,但是擁有最高的壓縮和解壓速率;smaz居中;unisox2擁有最高的壓縮率,可是它的壓縮和解壓效能最低。
結論與前兩章有關長字串壓縮的分析不謀而合:擁有高壓縮率,就會損失自身的壓縮效能,兩者不可兼得。
實際使用還是看自身需求和環境吧。如果適當壓縮就好,那就可以選用shoco,畢竟效能高;想要節約更多的空間,那就選擇smaz或者unisox2。
好了,字串壓縮系列的分享就到此為止了,如果對你有些許幫助,還請各位技術愛好者登入點贊呀,謝謝!
本文來自部落格園,作者:T-BARBARIANS,轉載請註明原文連結:https://www.cnblogs.com/t-bar/p/16506289.html 謝謝!