offline RL | Pessimistic Bootstrapping (PBRL)：在 Q 更新中懲罰 uncertainty，拉低 OOD Q value

論文題目：Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning，ICLR 2022，6 6 8 8 spotlight。
pdf 版本：https://arxiv.org/abs/2202.11566
html 版本：https://ar5iv.labs.arxiv.org/html/2202.11566
open review：https://openreview.net/forum?id=Y4cs1Z3HnqL
GitHub：https://github.com/Baichenjia/PBRL

0 abstract

Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.

background：
- offline RL 從之前收集的 dataset 中學習策略，而無需探索環境。由於 OOD actions 導致的 extrapolation error，將 off-policy RL 直接應用於 offline RL 通常會失敗。
- 先前工作通過 penalize OOD action 的 Q value，或去約束 trained policy 接近 behavior policy 來解決此類問題。
- 然而，這些方法通常阻止了 value function generalize 到 offline dataset 之外，並且也缺乏對 OOD data 的精確表徵（characterization）。
method：
- 我們提出了 offline RL 的悲觀引導（Pessimistic Bootstrapping，PBRL），它是一個純粹的 uncertainty-driven 的 offline 演演算法，沒有明確的 policy constraint。
- 具體的，PBRL 通過 bootstrapped Q functions 的 disagreement 進行 uncertainty 的量化，並根據所估計的 uncertainty，對 value function 進行懲罰，從而實施 pessimistic updates。
- 對於 extrapolation error 的處理，我們進一步提出了一種新的 OOD sampling 方法。
- 理論：上述 OOD sampling + pessimistic bootstrapping，在 linear MDP 中形成了一個 uncertainty 的量化器，是可以證明的。
實驗：
- 在 D4RL 基準測試上的大量實驗表明，與最先進的演演算法相比，PBRL 具有更好的效能。

3 method

3.1 使用 bootstrapped-Q function 進行 uncertainty 的量化

維護 K 個各自 bootstrap 更新的 Q-function。
uncertainty \(U(s,a)=\mathrm{std}(Q^k(s,a))=\sqrt{\frac1K\sum(Q^k-\bar Q)^2}\) 。（看 figure 1(a)，感覺定義是有道理的）

3.2 pessimistic learning - 悲觀學習

idea：基於 uncertainty 來懲罰 Q function。
PBRL 的 loss function 由兩部分組成：① ID 資料的 TD-error、② OOD 資料的偽 TD-error。
① ID 資料的 TD-error，見公式 (4)，大概就是 \(\hat T^{in}Q^k(s,a):=r+\gamma \hat E\big[Q^k(s',a')-\beta_{in}U(s',a')\big]\) ，對所轉移去的 (s',a') 的 uncertainty 進行懲罰。
- （上文的 ID (s, a, r, s', a') 由 offline dataset 得到）
② OOD 資料的偽 TD-error，s' 好像是 ID 的 state，a' 是 policy 生成的（可能是 OOD 的）action。
- 懲罰方式的 idea： \(\hat T^{ood}Q^k(s^{ood},a^{ood}):=Q^k(s^{ood},a^{ood})-\beta_{ood}U(s^{ood},a^{ood})\) ，直接減去它的 uncertainty。
- （如果 (s,a) 是 ID state-action，那麼 uncertainty 會很小）
- 相關的實現細節：早期 Q function 的截斷 \(\max[0, \hat T^{ood}Q^k(s,a)]\) ，在訓練初期使用大的 β ood 實現對 OOD action 的強懲罰，在訓練過程中不斷減小 β ood 的值。
- （感覺也算是使用 sarsa 式更新…）
loss function：
\[L_{critic}=\hat E_{(s,a,r,s')\sim D_{in}}\bigg[(\hat T^{in}Q^k-Q^k)^2\bigg] + \hat E_{s^{ood}\sim D_{in},~a^{ood}\sim\pi(s^{ood})}\bigg[(\hat T^{ood}Q^k-Q^k)^2\bigg] \]
policy： policy 希望最大化 Q function，具體的，最大化 ensemble Q 中的最小值。

3.3 是理論。