最近在資料處理中用到了窗函數, 把使用方法記錄一下, 暫時只有分組排序和滑動時間視窗的例子, 以後再逐步新增
在SQL查詢時, 會遇到有兩類需要分組統計的場景, 在之前的SQL語法中是不方便實現的
使用窗函數直接SQL中使用窗函數就能解決這些問題, 否則需要使用臨時表, 函數或儲存過程進行處理.
PostgreSQL 從2010年的版本8開始就支援窗函數了.
詳細說明建議檢視官方檔案 https://www.postgresql.org/docs/current/tutorial-window.html
窗函數(window function)的計算方式與傳統的單行和聚合不同
所以綜合的說, 視窗函數就是在行的基礎上, 允許對多行資料進行計算. 下面是一個簡單的窗函數例子, 將每個員工的薪資與其所在的部門的平均薪資進行比較
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
使用窗函數時會用到的一些關鍵詞
功能: 將資料按指定的欄位分組, 再按另一個欄位排列, 給每個分組裡的資料打上序號.
這是一個常用技巧, 例如要計算各組內記錄之間的時間間隔, 但是用時間不方便join, 打完序號後就可以用序號join了
SELECT
ROW_NUMBER() OVER w1 AS rn,
sample_01.*
FROM
sample_01
WINDOW
w1 AS (PARTITION BY field_name ORDER BY created_at ASC);
功能: 將資料表按指定欄位(日期型別)進行排序, 然後基於每個記錄的這個欄位建立一個固定寬度的時間視窗, 對視窗內的多個記錄進行統計
統計單個欄位, 可以直接寫在select中
SELECT
MAX(amount) OVER (ORDER BY traded_at RANGE '30 minutes' PRECEDING) AS amount_max,
*
FROM sample_01
WHERE card_num = '6210812500006111111'
功能: 和前一個功能一樣, 但是要進行多個不同的統計, 要重複用到這個視窗函數
如果要統計多個欄位, 可以抽出單獨的WINDOW
SELECT
MAX(rn) OVER w1 AS rn_max,
MAX(amount) OVER w1 AS amount_max,
AVG(amount) OVER w1 AS amount_avg,
*
FROM sample_01_diff
WINDOW
-- w1 AS (ORDER BY traded_at RANGE '30 minutes' PRECEDING)
w1 AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING)
ORDER BY
rn ASC
在這個例子中
In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column. For numeric ordering columns it is typically of the same type as the ordering column, but for datetime ordering columns it is an interval. For example, if the ordering column is of type date or timestamp, one could write
RANGE BETWEEN '1 day' PRECEDING AND '10 days'
FOLLOWING. The offset is still required to be non-null and non-negative, though the meaning of 「non-negative」 depends on its data type.
功能: 在前面的功能基礎上, 同時存在多個時間視窗
SELECT
-- 1 hour
SUM(amount_in) OVER w1h AS h1_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_in_count,
SUM(amount_out) OVER w1h AS h1_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_out_count,
SUM(amount) OVER w1h AS h1_amount_sum,
COUNT(amount) OVER w1h AS h1_amount_count,
ROUND(AVG(amount) OVER w1h, 2) AS h1_amount_avg,
FIRST_VALUE(amount) OVER w1h AS h1_amount_first,
LAST_VALUE(amount) OVER w1h AS h1_amount_last,
MAX(amount) OVER w1h AS h1_amount_max,
MIN(amount) OVER w1h AS h1_amount_min,
-- 3 hour
SUM(amount_in) OVER w3h AS h3_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_in_count,
SUM(amount_out) OVER w3h AS h3_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_out_count,
SUM(amount) OVER w3h AS h3_amount_sum,
COUNT(amount) OVER w3h AS h3_amount_count,
ROUND(AVG(amount) OVER w3h, 2) AS h3_amount_avg,
FIRST_VALUE(amount) OVER w3h AS h3_amount_first,
LAST_VALUE(amount) OVER w3h AS h3_amount_last,
MAX(amount) OVER w3h AS h3_amount_max,
MIN(amount) OVER w3h AS h3_amount_min,
*
FROM sample_01
WINDOW
w1h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING),
w3h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '90 minutes' PRECEDING AND '90 minutes' FOLLOWING)
;