流量劫持 —— GZIP 頁面零開銷注入 JS

前言

HTTP 代理給頁面注入 JS 是很常見的需求。由於上游伺服器返回的頁面可能是壓縮狀態的，因此需解壓才能注入，同時為了節省流量，返回下游時還得再壓縮。為了注入一小段程式碼，卻將整個頁面的流量解壓再壓縮，白白浪費大量效能。

是否有高效的解決方案？本文從注入位置、壓縮格式、校驗演演算法進行探討。

注入位置

常見的注入方式，是對某個 HTML 標籤進行替換，例如將 <head> 替換成 <head><script>...。

字元匹配的方式雖然簡單，但並不嚴謹。假如頁面中沒有出現 <head>，那麼就不會注入了。若要考慮大小寫、標籤存在屬性的情況，還得使用正則匹配。更極端的情況，例如第一個匹配點出現在註釋中，那麼注入的程式碼根本不會執行：

<html>
  <!-- <head></head> -->
  <head></head>
  <body></body>
</html>

至於在閘道器上解析 HTML 這樣的重量級操作，通常不會考慮。

現實中使用正則匹配足以支援大多數情況。不過正則匹配仍有一定的開銷，是否有更輕量甚至零開銷的注入方式？

其實可以有，直接將程式碼注入到頁面最頂端！這種做法雖然不規範，但主流瀏覽器都支援。如果擔心 doctype 失效，可以在注入的程式碼裡補上：

<!doctype html><script src="inject.js"></script>
<!doctype html>
<html>
  <head></head>
  <body></body>
</html>

這樣閘道器無需任何替換操作，只需轉發時將注入的程式碼拼在第一個 chunk 之前即可。

不過這只是明文傳輸的情況。如果上游返回的是壓縮流量，那麼在其之前拼上「壓縮後的注入程式碼」，是否仍有效？

我們以 gzip 為例接著探討。

檔案格式

gzip 使用 DEFLATE 演演算法壓縮資料（下圖 body 部分），並在前面加上 10 位元組的檔案頭、不定長的可選頭（記錄檔名等），末尾加上 8 位元組的檔案尾：

struct	field	length
header	magic number (1f 8b)	2
	compression method (08)	1
	flags	1
	timestamp	4
	compression flags	1
	operating system ID	1
extra headers (optional)	...	...
extra headers (optional)	...	...
body	block1	...
	block2	...
	...	...
trailer	CRC32	4
trailer	uncompressed data length	4

https://en.wikipedia.org/wiki/Gzip

由於我們的資料在最前面，因此需提供檔案頭，並刪除上游返回的檔案頭。

此外，還需要確定如下問題：

檔案尾的 CRC32 校驗是否需要更新
壓縮資料中每個 block 塊是否獨立

第一個問題即使不調研，大概也能猜到，在瀏覽器端肯定是不需要的。因為網頁是流模式的，收到一些渲染一些。等渲染完成後才說資料有問題，那網頁是留著還是不讓顯示？至少到目前還沒見過網頁提示 gzip 校驗失敗的錯誤。

第二個問題，在 RFC1951 中有講解：

Each block is compressed using a combination of the LZ77 algorithm
and Huffman coding. The Huffman trees for each block are independent
of those for previous or subsequent blocks; the LZ77 algorithm may
use a reference to a duplicated string occurring in a previous block,
up to 32K input bytes before.

Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.

https://www.rfc-editor.org/rfc/rfc1951

每個塊可能會參照之前塊的資料，好在參照方式是從當前位置計算的（<長度, 反向距離>），因此是個相對值，不會因資料流開頭插入我們的塊而受到干擾。

此外還需注意的是，每個塊的頭部有個 BFINAL 欄位標記當前是否為最後一塊，因此我們的塊中該欄位不能被標記，否則後續塊就不會解析了。

嘗試

我們用 Node.js 實現一個初步演示：

import zlib from 'node:zlib'
import http from 'node:http'

// 上游返回的 gzip 資料（出於演示，未使用流模式）
const htmlGzipBuf = zlib.gzipSync('<h1>Hello World</h1>')

// 注入程式碼的 gzip 資料（部分壓縮，防止被標記成最後一個 block）
let injectGzipBuf = Buffer.alloc(0)

const tmp = zlib.createGzip()
tmp.on('data', buf => {
  injectGzipBuf = Buffer.concat([injectGzipBuf, buf])
})
tmp.write('<!doctype html><script>console.log("Hi Jack")</script>')
tmp.flush()

http.createServer((req, res) => {
  res.setHeader('content-type', 'text/html')
  res.setHeader('content-encoding', 'gzip')
  // 輸出壓縮態的注入程式碼
  res.write(injectGzipBuf)
  // 跳過上游的 gzip 檔案頭（預設 10 位元組）
  res.end(htmlGzipBuf.subarray(10))
}).listen(8080)

這個案例中，我們兩次輸出的都是壓縮態資料，最終被瀏覽器成功解析。

經測試所有主流瀏覽器都沒問題，curl 也沒問題。但也有一些庫會校驗 CRC，例如 Node.js 的 fetch：

const res = await fetch('http://127.0.0.1:8080/')
const reader = res.body.getReader()
for (;;) {
  const {done, value} = await reader.read()
  if (done) {
    break
  }
  console.log(value)
}

讀取最後塊時報錯：

Uncaught TypeError: terminated
    at Fetch.onAborted ...
  [cause]: Error: incorrect data check
      at Zlib.zlibOnError [as onerror] ...
    code: 'Z_DATA_ERROR'

導致讀取的資料比預期少。

校驗演演算法

如何更新校驗值？最笨的辦法，就是把上游流量全都解開，重新計算一次 CRC。畢竟解壓的開銷比壓縮小很多，還是可以接受的。

不過本文追求的是低開銷甚至零開銷，因此這個方案很不完美。記得曾經開發防火牆時，如果封包只修改很小一部分，那麼 checksum 是不用重新計算的，只需稍加修正即可。這個思路是否可用在 CRC 上？畢竟 CRC 又不是什麼密碼學 hash 演演算法，就幾個簡單的 xor 運算，大概是可以玩出一些花招的。

一查檔案，發現不僅可以，甚至這個奇技淫巧還被 zlib 庫收錄了，提供了一個 crc32_combine 函數，用於合併兩個 CRC32 值：

crc32_combine(crc1, crc2, len2)

  Combine two CRC-32 check values into one.  For two sequences of bytes,
seq1 and seq2 with lengths len1 and len2, CRC-32 check values were
calculated for each, crc1 and crc2.  crc32_combine() returns the CRC-32
check value of seq1 and seq2 concatenated, requiring only crc1, crc2, and
len2.

至於原理細節，可參考：

https://stackoverflow.com/questions/23122312/crc-calculation-of-a-mostly-static-data-stream/23126768

https://github.com/stbrumme/crc32/blob/master/Crc32.cpp

使用這個方案，即可相容所有 HTTP 使用者端。

完整演示

前面的演示出於簡單，未考慮 gzip 擴充套件檔案頭，並且直接使用 Buffer 代替資料流。下面分享一個更完整的演示：

https://github.com/EtherDream/gzip-js-injector

後記

幾年前研究流量劫持時寫的文章，不過一直沒釋出，前段時間翻新了下並補了個 demo。由於那時還沒 brotli 壓縮，因此也沒調研。之後有時間再補充。