PostgreSQL 的視窗函數 OVER, WINDOW, PARTITION BY, RANGE

2022-05-27 12:00:50

最近在資料處理中用到了窗函數, 把使用方法記錄一下, 暫時只有分組排序和滑動時間視窗的例子, 以後再逐步新增

場景

在SQL查詢時, 會遇到有兩類需要分組統計的場景, 在之前的SQL語法中是不方便實現的

  1. 場景1: 顧客維修裝置的記錄表, 每次維修產生一條記錄, 每個記錄包含時間, 顧客ID和維修金額, 要取出每個顧客的維修次數和最後一次維修時的金額
  2. 場景2: 還是上面的維修記錄表, 要取出每個顧客的每次維修之間的時間間隔
  3. 場景3: 一個使用者賬戶的交易流水錶, 要求每個小時的交易筆數和平均收支金額, 這個平均數的統計範圍是兩個小時(整點時間的前後一個小時)

使用窗函數直接SQL中使用窗函數就能解決這些問題, 否則需要使用臨時表, 函數或儲存過程進行處理.

窗函數

PostgreSQL 從2010年的版本8開始就支援窗函數了.

檔案

詳細說明建議檢視官方檔案 https://www.postgresql.org/docs/current/tutorial-window.html

函數說明

窗函數(window function)的計算方式與傳統的單行和聚合不同

  1. 窗函數是在當前表中, 基於當前行的相關行的計算, 注意是基於多行的計算
  2. 屬於一種聚合計算, 可以使用聚合型別的函數(aggregate function)
  3. 使用窗函數並不會導致結果的聚合, 也就是結果依然是當前的行結構

所以綜合的說, 視窗函數就是在行的基礎上, 允許對多行資料進行計算. 下面是一個簡單的窗函數例子, 將每個員工的薪資與其所在的部門的平均薪資進行比較

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;

關鍵詞

使用窗函數時會用到的一些關鍵詞

  • OVER 前面的查詢基於後面的視窗
  • PARTITION BY 類似於 GROUP BY 的語意, 專用於視窗的分組
  • ORDER BY 窗內的排序依據, 依據的欄位決定了 RANGE 的型別
  • RANGE ... PRECEDING 在當前值之前的範圍, 基準是當前記錄這個 ORDER BY 欄位的值
  • RANGE ... FOLLOWING 在當前值之後的範圍, 基準是當前記錄這個 ORDER BY 欄位的值
  • RANGE BETWEEN ... PRECEDING AND ... FOLLOWING 前後範圍的組合
  • WINDOW 將視窗命名為變數, 可以在 SELECT 中重複使用

範例

按視窗打序號

功能: 將資料按指定的欄位分組, 再按另一個欄位排列, 給每個分組裡的資料打上序號.

這是一個常用技巧, 例如要計算各組內記錄之間的時間間隔, 但是用時間不方便join, 打完序號後就可以用序號join了

SELECT
    ROW_NUMBER() OVER w1 AS rn,
    sample_01.*
FROM
    sample_01
WINDOW 
w1 AS (PARTITION BY field_name ORDER BY created_at ASC);

簡單時間視窗統計

功能: 將資料表按指定欄位(日期型別)進行排序, 然後基於每個記錄的這個欄位建立一個固定寬度的時間視窗, 對視窗內的多個記錄進行統計

統計單個欄位, 可以直接寫在select中

SELECT
    MAX(amount) OVER (ORDER BY traded_at RANGE '30 minutes' PRECEDING) AS amount_max,
    *
FROM sample_01
WHERE card_num = '6210812500006111111'

基於時間視窗變數進行多欄位統計

功能: 和前一個功能一樣, 但是要進行多個不同的統計, 要重複用到這個視窗函數

如果要統計多個欄位, 可以抽出單獨的WINDOW

SELECT
    MAX(rn) OVER w1 AS rn_max,
    MAX(amount) OVER w1 AS amount_max,
    AVG(amount) OVER w1 AS amount_avg,
    *
FROM sample_01_diff
WINDOW
    -- w1 AS (ORDER BY traded_at RANGE '30 minutes' PRECEDING)
    w1 AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING)
ORDER BY
    rn ASC

在這個例子中

  1. 先依據 card_num 這個欄位進行分割區,
  2. 然後按 traded_at 這個欄位進行排序,
  3. 對每個記錄的 traded_at 值, 開啟一個 RANGE, 包含前面的30分鐘和後面的30分鐘, RANGE 中能用的型別和 ORDER BY 的欄位型別是相關的
  4. SELECT中的 MAX, MIN 等聚合函數, 是基於上面的 RANGE 進行的

In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column. For numeric ordering columns it is typically of the same type as the ordering column, but for datetime ordering columns it is an interval. For example, if the ordering column is of type date or timestamp, one could write RANGE BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING. The offset is still required to be non-null and non-negative, though the meaning of 「non-negative」 depends on its data type.

多個視窗多個欄位同時統計

功能: 在前面的功能基礎上, 同時存在多個時間視窗

SELECT
    -- 1 hour
    SUM(amount_in) OVER w1h AS h1_amount_in_sum,
    SUM(
        CASE
            WHEN amount_in = 0 THEN 0
            ELSE 1
        END
    ) OVER w1h AS h1_amount_in_count,
    SUM(amount_out) OVER w1h AS h1_amount_out_sum,
    SUM(
        CASE
            WHEN amount_out = 0 THEN 0
            ELSE 1
        END
    ) OVER w1h AS h1_amount_out_count,
    SUM(amount) OVER w1h AS h1_amount_sum,
    COUNT(amount) OVER w1h AS h1_amount_count,
    ROUND(AVG(amount) OVER w1h, 2) AS h1_amount_avg,
    FIRST_VALUE(amount) OVER w1h AS h1_amount_first,
    LAST_VALUE(amount) OVER w1h AS h1_amount_last,
    MAX(amount) OVER w1h AS h1_amount_max,
    MIN(amount) OVER w1h AS h1_amount_min,
    -- 3 hour
    SUM(amount_in) OVER w3h AS h3_amount_in_sum,
    SUM(
        CASE
            WHEN amount_in = 0 THEN 0
            ELSE 1
        END
    ) OVER w3h AS h3_amount_in_count,
    SUM(amount_out) OVER w3h AS h3_amount_out_sum,
    SUM(
        CASE
            WHEN amount_out = 0 THEN 0
            ELSE 1
        END
    ) OVER w3h AS h3_amount_out_count,
    SUM(amount) OVER w3h AS h3_amount_sum,
    COUNT(amount) OVER w3h AS h3_amount_count,
    ROUND(AVG(amount) OVER w3h, 2) AS h3_amount_avg,
    FIRST_VALUE(amount) OVER w3h AS h3_amount_first,
    LAST_VALUE(amount) OVER w3h AS h3_amount_last,
    MAX(amount) OVER w3h AS h3_amount_max,
    MIN(amount) OVER w3h AS h3_amount_min,
    *
FROM sample_01
WINDOW
    w1h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING),
    w3h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '90 minutes' PRECEDING AND '90 minutes' FOLLOWING)
;

參考