offline RL | IQL：通過 sarsa 式 Q 更新避免 unseen actions

題目：Offline Reinforcement Learning with Implicit Q-Learning，Sergey Levine 組，2022 ICLR，5 6 8。
pdf 版本：https://arxiv.org/pdf/2110.06169.pdf
html 版本：https://ar5iv.labs.arxiv.org/html/2110.06169
open review：https://openreview.net/forum?id=68n2s9ZJWF8
github：
- https://github.com/ikostrikov/implicit_q_learning
- https://github.com/rail-berkeley/rlkit/tree/master/examples/iql
兩篇相關部落格：
- https://zhuanlan.zhihu.com/p/497358947
- https://blog.csdn.net/wxc971231/article/details/128803648

0 abstract

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values.

We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions.

We dub our method Implicit Q-learning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss.

IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

background：
- offline RL 需要調和兩個相互衝突的目標：① 學習一種比 behavior policy 改進的策略，② 儘量減少與 behavior policy 的偏差，以避免由於 distribution shift 而導致的錯誤。
- 這種 trade-off 至關重要，因為當前的大多數 offline RL 方法，都需要在訓練期間查詢 unseen actions 的 value 來改進策略，因此需要將這些 action 限制為 in distribution，或者將 value 正則化。
method：
- 我們提出了一種新的 offline RL 方法 Implicit Q-learning（IQL），完全不需要評估資料集外的 action，但仍然使學習到的策略，能夠通過泛化（generalization），大大改善資料中的最佳行為。
- main insight：我們不去評估 latest policy 中的 unseen actions，而是去 implicitly 近似 policy improvement step，通過將 state value function（V function？）視為隨機變數，其中 randomness 由 action 決定（同時仍對動態（dynamics）進行積分（integrating）以避免過度樂觀），然後取 state value function 的 conditioned on state 的上限期望，來估計該狀態下最佳行動的值。
- IQL 利用了 function approximator 的泛化能力，來估計給定狀態下最佳 available action 的 value，而非直接使用 unseen action 來 query Q function。
- IQL 交替進行 ① 擬合這個期望上限（upper expectile）的 value function，② 將其備份到 Q function。IQL 沒有任何顯式的 policy，通過 advantage-weighted behavioral cloning 來提取策略，這也避免了查詢 out-of-sample actions。
- IQL 易於實現，計算效率高，並且只需要擬合一個具有非對稱 L2 loss 的額外的 critic。
results：
- IQL 在 D4RL 上取得了最先進的效能，D4RL 是 offline RL 的 standard benchmark。
- offline 2 online：我們還（通過實驗）證明，IQL 在離線初始化後使用線上互動，實現了強大的效能 fine-tuning。

open review

contribution：
- IQL 基於期望迴歸（expectile regression）的 novel idea，通過專注於 in-sample actions，避免查詢 unseen actions 的 values，具有穩定的效能。The paper is well written。
- 在 policy improvement 階段，in-sample policy evaluation + advantage-weighted regression。利用類似 sarsa 的 TD 來更新 Q-function。（sarsa：收集 transition (s, a, r, s', a')，更新 Q(s, a) = r(s, a) + γQ(s', a') 。）
- 研究瞭如何在 Q 更新期間，避免使用 out-of-sample 或 unseen 的 actions 進行更新。
- 正如 BCQ（好像是一篇 offline 工作）所說，無約束的策略提取方案，在 offline RL 中失敗；因此，我們選擇受約束的策略提取方案，advantage-weighted regression（AWR）。
- expectile regression + training Q-function + 使用 advantage-weighted behavioral cloning 來提取 policy 。
實驗：
- 我們的方法比最近的方法（例如 TD3+BC，是 NeurIPS Spotlight，以及 Onestep RL，也來自 NeurIPS）有很大的改進：在 locomotion 上有改進，在 AntMaze 上獲得了 3 倍的改進。在先前方法中，效能上與我們最接近的是 CQL。然而，在最困難的任務 AntMaze 中，我們比 CQL 提高了 25%。
- 請注意，TD3+BC 在效能方面並沒有改進（如表 1 所示），它只是簡單。我們的工作既引入了一種更簡單的方法，並且取得了更好的效能。此外，我們的方法在執行時（快 4 倍）和 fine-tune 實驗（改進 2 倍，表 2）方面，比 CQL 有非常大的改進。
優點：
- novel idea：對 ID actions 使用 expectile regression，來學習基於 ID 的 high-performance actions 的 value function。
  - 之前有很多關於在 RL 上使用分位數迴歸（quantile regression）的研究，但大多數研究都學習一個價值函數分佈，其中隨機性來自環境，而分位數迴歸（quantile regression）通常為了提高最壞情況的魯棒性。
  - 然而，本文對值函數使用期望迴歸（expectile regression），其中隨機性來自動作，並表明它推廣了貝爾曼期望方程（Bellman expectation equation）和貝爾曼最優方程（Bellman optimality equation）。這是第一個提出的工作。
  - 同時使用 V 和 Q 函數來實現這種學習的技巧，似乎也非常聰明。
- 理論也好，實驗結果也好。類似的魔改，或許可以應用在 online RL 上。
疑惑：
- 為什麼選擇 expectile regression 進行 value function 更新，而非均值迴歸（mean regression）？（然後發現這個實驗其實做過了，reviewer 看漏了）
- IQL 在 gym locomotion 任務的結果，與 CQL 的結果相當或更差，這有點違反直覺，因為 IQL 比 CQL 更類似於 behavior cloning，但在資料質量更好的任務上表現不佳。
- 既然學到的最優 Q 函數非常好且準確，為什麼不直接根據學到的 Q 函數，去優化一個引數化策略（確定性或高斯策略），例如最大化 E_(s,a) [Q(s,a)] ？為什麼必須使用行為克隆？事實上，BC 很難超過資料集中的最佳策略。回答：IQL 學習的 Q(s,a) 未針對 OOD actions 進行定義。因此，Q(s,a) 的無約束最大化，可能會導致選擇 value 被錯誤高估的 action。
- 從證明來看，只有當期望值 tau 達到極限 1 時，學習到的 value function 才能在資料下是最優的。從程式碼中，我觀察到 MuJoCo tau=0.7、Adroit tau=0.7、Ant-maze tau=0.9，都沒有達到 limit 1，因此認為理論與實現之間存在差距。

建議直接看這篇部落格…

https://zhuanlan.zhihu.com/p/497358947 ，感覺寫的已經很好了。

https://blog.csdn.net/wxc971231/article/details/128803648 ，可以同時參考這一篇。

部落格 1.1 1.2 對應 section 2 的 related work。
1.3 對應 section 3 preliminaries + section 4 的介紹（4.1 前）。
1.4 講解了 expectile regression（期望迴歸），這是一個非對稱的 L2 損失，用來 minimize \(L(θ)=\mathbb E_{(s,a,r,s',a')\sim D}\big[L_2^\tau\big(r+γQ_{\hat θ}(s',a')-Q_θ(s,a)\big)\big]\) 。
- expectile regression ： loss = \(\sum\max[\tau(y_i-\hat y_i),(\tau-1)(y_i-\hat y_i)]\) 。
- 所以，這個 loss 其實也可以用 MSE 或 quantile regression（也是一種 loss function 嘛？），但作者聲稱 expectile regression 最好用。
- 貌似後面有基於 expectile regression 的證明（？）
- 當 τ = 0.5 時 loss 退化為 MSE；τ 越接近 1，模型就越傾向擬合那些 TD error 更大的 transition，從而使 Q 估計靠近資料集上的上界；當 τ → 1 時，可認為得到了 Q* 。
2.1 對應原文 section 4.2。
- 直接使用上面說的 expectile loss，問題在於引入了環境隨機性 \(s'\sim p(·|s,a)\) ：一個大的 TD target 可能只是來自碰巧取樣到的「好狀態」，即使這個概率很小，也會被 expectile regression 找出來，導致 Q value 高估。
- 為此，IQL 又學習了一個獨立的 state value function，它的 loss function 是 \(L_V(\psi)=E_{(s,a)\sim D}[L_2^\tau(Q_{\hat \theta}(s,a) -V_\psi(s))]\) 。
  - θ hat 在前面 θ 的 loss function 也出現過，大概是為了防止 Q 偏移而設定的 Q target。
  - 總之就是學習一個 value function，其中 action 對於特定 state 的分佈是資料集 D 中給定的。
- 然後，使用這個 state value function 來更新 Q function，來避免因為隨機「好狀態」而錯判一個 action 為好 action。
  - \(L_Q(\theta)=E_{(s,a,s')\sim D}\big[r(s,a)+γV_\psi(s')-Q_θ(s,a)\big]^2\) 。
  - 這個 loss function 是 MSE，好像不是 expectile loss 的形式。
  - （雖然不太明白為什麼能規避隨機「好狀態」的問題）
- 使用 Clipped Double Q-Learning 來緩解 Q value 的高估。
2.2 2.3 對應原文 section 4.3。
- Advantage-Weighted Regression，通過 dataset + value function 得到 policy。
- AWR 的 loss： \(L_\pi(\phi)=E_{(s,a)\sim D}[\exp[β(Q_{\hat θ}(s,a)-V_\psi(s))]\log \pi_\phi(a|s)]\) 。
- 若 β = 0，完全變成 behavior cloning；若 β 變大，則變成加權 behavior cloning，權重是 exp advantage。
- IQL 演演算法流程：先更新 \(L_V(\psi)\) （使用了 expectile regression 的 loss），再更新 \(L_Q(θ)\) （MSE loss 的形式），然後 \(\hat θ\leftarrow(1-α)\hat θ+αθ\) ，這樣迭代得到 value function Q 和 V，最後使用 AWR 提取 policy。
- 搬運：
- \(L_Q(\theta)=E_{(s,a,s')\sim D}\big[r(s,a)+γV_\psi(s')-Q_θ(s,a)\big]^2\) 。
- \(L_V(\psi)=E_{(s,a)\sim D}[L_2^\tau(Q_{\hat \theta}(s,a) -V_\psi(s))]\) 。
- 規避隨機「好狀態」：大概因為，若 s‘ 特別好，則 Q 和 V 都會很好，policy 的加權是 advantage = Q - V，不會受影響。
section 4.4 貌似是數學證明。
- 我們證明了在某些假設下，我們的方法確實逼近了最優的 state-action value Q* 。
- 可以證明，lim τ→1 時（τ 大概是 expectile regression 的引數）， \(V_τ(s) → \max_{\pi_\beta(a|s)>0}Q^*(s,a)\) ，能達到 dataset 裡面出現的最好 action 的 Q value。
section 5 experiment：
- 次優軌跡拼接能力更強：超過 DT、one-step 方法。
- 不使用 unseen action，緩解 distribution shift：超過 TD3+BC 和 CQL 等約束 policy 的方法。
- IQL 在訓練時間上也有優勢。
- online fine-tune：聽說 AWAC 是專門為 online fine-tune 而提出的。先 offline，再 online 10w 步，IQL 效能最好，CQL 次之，AWAC 因為 offline 效能初始化不好而最差。