概述
- 论文题目:Reinforcement Learning with Deep Energy-Based Policies
所解决的问题?
作者提出一种energy-based
的强化学习算法,将其运用于连续的状态和动作空间问题中,将其称之为Soft Q-Learning
。这种算法的好处就是鲁棒性和tasks
之间的skills transfer
。
背景
以往的方法是通过stochastic policy
来增加一点exploration
,例如增加噪声,或者使用一个entropy
很高的policy
来对其进行初始化。但是有时候我们确实会期望去学一个stochastic behaviors
(鲁棒性会更强,具体参见文末扩展阅读)。
那这样的一种stochastic policy
会是optimal policy
吗?当我们考虑一个最优的控制和概率推断问题之间的联系的话( consider the connection between optimal control and probabilistic inference),stochastic policy可以被视为是一种最优的选择(optimal answer )。(Todorov, 2008
)
- 参考:Todorov, E. General duality between optimal control and estimation. In IEEE Conf. on Decision and Control, pp. 4286–4292. IEEE, 2008.
- 参考:Toussaint, M. Robot trajectory optimization using approximate inference. In Int. Conf. on Machine Learning, pp. 1049–1056. ACM, 2009
直观理解就是,将控制问题作为一个推理的过程(framing control as inference produces policies),目的不仅仅是为了去产生一个确定性的lowest cost behavior
,而是整个low-cost behavior
。(Instead of learning the best way to perform the task, the resulting policies try to learn all of the ways of performing the task.)也就是我要找到这个问题所有的“最优解”。
这种方法也可以作为一个困难问题的初始化,比如用这种方法训练一个robot
向前走的model
,然后这个model
作为下次训练robot
跳跃、奔跑的初始化参数;在多模态的奖励空间中是一种更好的exploration
机制(a better exploration mechanism for seeking out the best mode in a multi-modal reward landscape);由于behavior
的选择变多了,所以在处理干扰的时候,鲁棒性更强。
前人也有一些stochastic policy
的一些研究(参考文末资料),但是大部分都难以用于高维连续动作空间。或者是一些简单的高斯策略分布(very limited
)。那能不能去找到一个任意分布的策略分布呢?
作者提出了一种energy-based model(EBM)
的方法,energy function
为soft Q function
。
所采用的方法?
Maximum Entropy Reinforcement Learning
标准的强化学习算法的优化目标为:
π s t d ∗ = arg max π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] pi_{mathrm{std}}^{*}=arg max _{pi} sum_{t} mathbb{E}_{left(mathbf{s}_{t}, mathbf{a}_{t}right) sim rho_{pi}}left[rleft(mathbf{s}_{t}, mathbf{a}_{t}right)right] πstd∗=argπmaxt∑E(st,at)∼ρπ[r(st,at)]
Maximum entropy RL
算法的优化目标:
π M a x E n t ∗ = arg max π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] pi_{mathrm{MaxEnt}}^{*}=arg max _{pi} sum_{t} mathbb{E}_{left(mathbf{s}_{t}, mathbf{a}_{t}right) sim rho_{pi}}left[rleft(mathbf{s}_{t}, mathbf{a}_{t}right)+alpha mathcal{H}left(pileft(cdot | mathbf{s}_{t}right)right)right] πMaxEnt∗=argπmaxt∑E(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]
其中
α
alpha
α是衡量reward
和entropy
之间的权重系数。与以往的Boltzman exploration
和PGQ
算法不一样的地方在于,maximum entropy objective
会使得整个trajectory
的policy
分布的entropy
变大。
Soft Value Functions and Energy-Based Models
传统的RL
方法一般action
是一个单峰的策略分布(unimodal policy distribution
,下图中左图所示),而我们想要探索整个的action
分布,很自然的想法就是对其取幂,就变成了一个多峰策略分布 (multimodal policy distribution
)。
- Energy based model和soft Q function的关系:
由此作者使用了一种energy-based
的policy
方法,如下形式:
π ( a t ∣ s t ) ∝ exp ( − E ( s t , a t ) ) pileft(mathbf{a}_{t} | mathbf{s}_{t}right) propto exp left(-mathcal{E}left(mathbf{s}_{t}, mathbf{a}_{t}right)right) π(at∣st)∝exp(−E(st,at))
其中
E
mathcal{E}
E是energy function
,可以用neural network
来表示。
Theorem1. Let the soft Q-function be defined :
定义soft q function
:
Q s o f t ∗ ( s t , a t ) = r t + E ( s t + 1 , … ) ∼ ρ π [ ∑ l = 1 ∞ γ l ( r t + l + α H ( π M a x E n t ∗ ( ⋅ ∣ s t + l ) ) ) ] begin{array}{l} Q_{mathrm{soft}}^{*}left(mathbf{s}_{t}, mathbf{a}_{t}right)=r_{t}+ \ mathbb{E}_{left(mathbf{s}_{t+1}, ldotsright) sim rho_{pi}}left[sum_{l=1}^{infty} gamma^{l}left(r_{t+l}+alpha mathcal{H}left(pi_{mathrm{MaxEnt}}^{*}left(cdot | mathbf{s}_{t+l}right)right)right)right] end{array} Qsoft∗(st,at)=rt+E(st+1,…)∼ρπ[∑l=1∞γl(rt+l+αH(πMaxEnt∗(⋅∣st+l)))]
和soft value function
:
V s o f t ∗ ( s t ) = α log ∫ A exp ( 1 α Q s o f t ∗ ( s t , a ′ ) ) d a ′ V_{mathrm{soft}}^{*}left(mathbf{s}_{t}right)=alpha log int_{mathcal{A}} exp left(frac{1}{alpha} Q_{mathrm{soft}}^{*}left(mathbf{s}_{t}, mathbf{a}^{prime}right)right) d mathbf{a}^{prime} Vsoft∗(st)=αlog∫Aexp(α1Qsoft∗(st,a′))da′
Maximum entropy RL
算法的优化目标:
π M a x E n t ∗ = arg max π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] pi_{mathrm{MaxEnt}}^{*}=arg max _{pi} sum_{t} mathbb{E}_{left(mathbf{s}_{t}, mathbf{a}_{t}right) sim rho_{pi}}left[rleft(mathbf{s}_{t}, mathbf{a}_{t}right)+alpha mathcal{H}left(pileft(cdot | mathbf{s}_{t}right)right)right] πMaxEnt∗=argπmaxt∑E(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]
由此可以得到上述Maximum entropy RL
算法的优化目标的 the optimal policy
:
π M a x E n t ∗ ( a t ∣ s t ) = exp ( 1 α ( Q s o f t ∗ ( s t , a t ) − V s o f t ∗ ( s t ) ) ) pi_{mathrm{MaxEnt}}^{*}left(mathbf{a}_{t} | mathbf{s}_{t}right)=exp left(frac{1}{alpha}left(Q_{mathrm{soft}}^{*}left(mathbf{s}_{t}, mathbf{a}_{t}right)-V_{mathrm{soft}}^{*}left(mathbf{s}_{t}right)right)right) πMaxEnt∗(at∣st)=exp(α1(Qsoft∗(st,at)−Vsoft∗(st)))
Soft Q Learning中Policy Improvement 证明中有上述公式定义的部分解释(最优策略一定会满足这种energy-based
的形式)。
Theorem1
将maximum entropy objective
和energy-based
的方法联系在一起了。其中
1
α
Q
s
o
f
t
(
s
t
,
a
t
)
frac{1}{alpha} Q_{mathrm{soft}}left(mathbf{s}_{t}, mathbf{a}_{t}right)
α1Qsoft(st,at) acts as the negative energy。
1
α
V
s
o
f
t
(
s
t
)
frac{1}{alpha}V_{soft}(s_{t})
α1Vsoft(st) serve as the log-partition function。
Soft Q function会满足Soft Bellman Equation
Q s o f t ∗ ( s t , a t ) = r t + γ E s t + 1 ∼ p s [ V s o f t ∗ ( s t + 1 ) ] Q_{mathrm{soft}}^{*}left(mathbf{s}_{t}, mathbf{a}_{t}right)=r_{t}+gamma mathbb{E}_{mathbf{s}_{t+1} sim p_{mathbf{s}}}left[V_{mathrm{soft}}^{*}left(mathbf{s}_{t+1}right)right] Qsoft∗(st,at)=rt+γEst+1∼ps[Vsoft∗(st+1)]
到此一些基本的定义就定义完成了,之后我们需要将Q-Learning
的算法用于maximum entropy policy
就可以了。
Training Expressive Energy-Based Models via Soft Q-Learning
通过压缩映射能够证明:
Q s o f t ( s t , a t ) ← r t + γ E s t + 1 ∼ p s [ V s o f t ( s t + 1 ) ] , ∀ s t , a t V s o f t ( s t ) ← α log ∫ A exp ( 1 α Q s o f t ( s t , a ′ ) ) d a ′ , ∀ s t begin{aligned} Q_{mathrm{soft}}left(mathbf{s}_{t}, mathbf{a}_{t}right) & leftarrow r_{t}+gamma mathbb{E}_{mathbf{s}_{t+1} sim p_{mathrm{s}}}left[V_{mathrm{soft}}left(mathbf{s}_{t+1}right)right], forall mathbf{s}_{t}, mathbf{a}_{t} \ V_{mathrm{soft}}left(mathbf{s}_{t}right) & leftarrow alpha log int_{mathcal{A}} exp left(frac{1}{alpha} Q_{mathrm{soft}}left(mathbf{s}_{t}, mathbf{a}^{prime}right)right) d mathbf{a}^{prime}, forall mathbf{s}_{t} end{aligned} Qsoft(st,at)Vsoft(st)←rt+γEst+1∼ps[Vsoft(st+1)],∀st,at←αlog∫Aexp(α1Qsoft(st,a′))da′,∀st
会收敛到
Q
s
o
f
t
∗
Q_{soft}^{*}
Qsoft∗和
V
s
o
f
t
∗
V_{soft}^{*}
Vsoft∗。然后这里还是有几个点需要去考虑,比如如何将其用于大规模的state
、action
空间。从energy-based
中采样会变得很棘手(intractable
)。
Soft Q Learning
即使证明了soft
贝尔曼方程会收敛,但是
V
s
o
f
t
∗
V_{soft}^{*}
Vsoft∗的计算过程中含有积分项,因此处理起来还是会很困难。作者用function approximator
来定义
Q
s
o
f
t
θ
(
s
,
a
)
Q_{soft}^{theta}(s,a)
Qsoftθ(s,a)。
First
,想要用stochastic optimization
方法来对上述公式进行优化,我们首先将soft value function
通过重要性采样得到其期望的形式:
V s o f t θ ( s t ) = α log E q a ′ [ exp ( 1 α Q s o f t θ ( s t , a ′ ) ) q a ′ ( a ′ ) ] V_{mathrm{soft}}^{theta}left(mathbf{s}_{t}right)=alpha log mathbb{E}_{q_{mathbf{a}^{prime}}}left[frac{exp left(frac{1}{alpha} Q_{mathrm{soft}}^{theta}left(mathbf{s}_{t}, mathbf{a}^{prime}right)right)}{q_{mathbf{a}^{prime}}left(mathbf{a}^{prime}right)}right] Vsoftθ(st)=αlogEqa′[qa′(a′)exp(α1Qsoftθ(st,a′))]
其中
q
a
′
q_{a^{prime}}
qa′可以为action space
中的任意一个分布。我们可以将soft Q-Iteration 表示为最小化形式:
J Q ( θ ) = E s t ∼ q s t , a t ∼ q a t [ 1 2 ( Q ^ s o f t θ ˉ ( s t , a t ) − Q s o f t θ ( s t , a t ) ) 2 ] J_{Q}(theta)=mathbb{E}_{mathbf{s}_{t} sim q_{s_{t}}, mathbf{a}_{t} sim q_{mathbf{a}_{t}}}left[frac{1}{2}left(hat{Q}_{mathrm{soft}}^{bar{theta}}left(mathbf{s}_{t}, mathbf{a}_{t}right)-Q_{mathrm{soft}}^{theta}left(mathbf{s}_{t}, mathbf{a}_{t}right)right)^{2}right] JQ(θ)=Est∼qst,at∼qat[21(Q^softθˉ(st,at)−Qsoftθ(st,at))2]
其中
Q
^
s
o
f
t
θ
ˉ
(
s
t
,
a
t
)
=
r
t
+
γ
E
s
t
+
1
∼
p
s
[
V
s
o
f
t
θ
(
s
t
+
1
)
]
hat{Q}_{mathrm{soft}}^{bar{theta}}left(mathbf{s}_{t}, mathbf{a}_{t}right)=r_{t}+gamma mathbb{E}_{mathbf{s}_{t+1} sim p_{mathbf{s}}}left[V_{mathrm{soft}}^{theta}left(mathbf{s}_{t+1}right)right]
Q^softθˉ(st,at)=rt+γEst+1∼ps[Vsoftθ(st+1)]是target Q-Value
。
Approximate Sampling and Stein Variational Gradient Descent (SVGD)
那我们如何从soft q function
中采样呢?传统的从energy-based
分布中采样通常会有两种策略:1. use Markov chain Monte Carlo (MCMC) based sampling;2. learn a stochastic sampling network trained to output approximate samples from the target distribution . 然而作者依据2016年Liu, Q. and Wang, D.提出的两种方法,a sampling network based on Stein variational gradient descent (SVGD) 和 amortized SVGD.做采样。
- Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.
- Wang, D. and Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
这样做的好处主要有三点,提供一个stochastic sample generation
;会收敛到EBM
精确的后验分布;第三他可以跟actor critic
算法联系起来,也就有了之后的SAC
。
我们想要去学习一个state-conditioned stochastic neural network
a
t
=
f
ϕ
(
ξ
;
s
t
)
mathbf{a}_{t}=f^{phi}left(xi ; mathbf{s}_{t}right)
at=fϕ(ξ;st),
ϕ
phi
ϕ 为网络参数,
ξ
xi
ξ 为高斯或者其他任意一个分布的噪声。想要去寻找一个参数
ϕ
phi
ϕ下的动作分布
π
ϕ
(
a
t
,
s
t
)
pi^{phi}(a_{t},s_{t})
πϕ(at,st),期望这个分布能够近似energy-based
的分布,KL divergence
定义如下:
J π ( ϕ ; s t ) = D K L ( π ϕ ( ⋅ ∣ s t ) ∥ exp ( 1 α ( Q soft θ ( s t , ⋅ ) − V soft θ ) ) ) J_{pi}left(phi ; mathbf{s}_{t}right)= D_{K L}left(pi^{phi}left(cdot | mathbf{s}_{t}right) | exp left(frac{1}{alpha}left(Q_{text {soft }}^{theta}left(mathbf{s}_{t}, cdotright)-V_{text {soft }}^{theta}right)right)right) Jπ(ϕ;st)=DKL(πϕ(⋅∣st)∥exp(α1(Qsoft θ(st,⋅)−Vsoft θ)))
Stein variationa lgradient descent
如下:
Δ f ϕ ( ⋅ ; s t ) = E a t ∼ π ϕ [ κ ( a t , f ϕ ( ⋅ ; s t ) ) ∇ a ′ ] Q s o f t θ ( s t , a ′ ) ∣ a ′ = a t + α ∇ a ′ κ ( a ′ , f ϕ ( ⋅ ; s t ) ) ∣ a ′ = a t ] begin{aligned} Delta f^{phi}left(cdot ; mathbf{s}_{t}right)= mathbb{E}_{mathbf{a}_{t}sim pi^{phi}}[kappaleft(mathbf{a}_{t}, f^{phi}left(cdot ; mathbf{s}_{t}right)right) nabla_{mathbf{a}^{prime}} ]Q_{mathrm{soft}}^{theta}left(mathbf{s}_{t}, mathbf{a}^{prime}right)|_{mathbf{a}^{prime}=mathbf{a}_{t}}\+alpha nabla_{mathbf{a}^{prime}} kappa(mathbf{a}^{prime}, f^{phi}(cdot ; mathbf{s}_{t}))|_{mathbf{a}^{prime}=mathbf{a}_{t}}] end{aligned} Δfϕ(⋅;st)=Eat∼πϕ[κ(at,fϕ(⋅;st))∇a′]Qsoftθ(st,a′)∣a′=at+α∇a′κ(a′,fϕ(⋅;st))∣a′=at]
其中 κ kappa κ表示核函数, Δ f ϕ Delta f^{phi} Δfϕ是the optimal direction of the reproducing kernel Hilbert space of κ kappa κ,使用链导法则和Stein variational gradient into policy network我们有:
∂ J π ( ϕ ; s t ) ∂ ϕ ∝ E ξ [ Δ f ϕ ( ξ ; s t ) ∂ f ϕ ( ξ ; s t ) ∂ ϕ ] frac{partial J_{pi}left(phi ; mathbf{s}_{t}right)}{partial phi} propto mathbb{E}_{xi}left[Delta f^{phi}left(xi ; mathbf{s}_{t}right) frac{partial f^{phi}left(xi ; mathbf{s}_{t}right)}{partial phi}right] ∂ϕ∂Jπ(ϕ;st)∝Eξ[Δfϕ(ξ;st)∂ϕ∂fϕ(ξ;st)]
取得的效果?
所出版信息?作者信息?
这篇文章是ICML2017
上面的一篇文章。第一作者Tuomas Haarnoja
是Google DeepMind
的research Scientist
。
参考链接
-
https://zhuanlan.zhihu.com/p/70360272
-
https://zh.wikipedia.org/wiki/%E7%8E%BB%E5%B0%94%E5%85%B9%E6%9B%BC%E5%88%86%E5%B8%83
-
https://zhuanlan.zhihu.com/p/44783057
-
https://zhuanlan.zhihu.com/p/76681229
-
https://www.dazhuanlan.com/2019/11/30/5de17e0ec54b1/
-
代码链接:https://github.com/haarnoja
扩展阅读
为什么要使用Stochastic Policy
在有些情况下我们需要去学习一个stochastic policy
,为什么要去学这样一个stochastic policy
呢?作者举例了两点理由:
- exploration in the presence of multimodal objectives(多模态的信息来源), and compositionality attained via pretraining. (
Daniel et al., 2012
) - 增加在不确定环境下的鲁棒性(
Ziebart,2010
),在模仿学习中(Ziebartetal.,2008
),改善收敛性和计算性能( improved convergence and computational properties) (Gu et al., 2016a
)
- 参考文献1:Daniel, C., Neumann, G., and Peters, J. Hierarchical relative entropy policy search. In AISTATS, pp. 273–281, 2012.
- 参考文献2:Ziebart,B.D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.
- 参考文献3:Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433– 1438, 2008.
- 参考文献4:Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine,S. Q-prop: Sample-efficientpolicygradientwith an off-policy critic. arXiv preprint arXiv:1611.02247, 2016a.
前人在 maximum entropy stochastic policy上的研究
- Z-learning (
Todorov, 2007
);
Todorov, E. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems, pp. 1369–1376. MIT Press, 2007.
- maximum entropy inverse RL(
Ziebartetal.,2008
);
Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433– 1438, 2008.
- approximate inference using message passing (
Toussaint, 2009
);
- Toussaint, M. Robot trajectory optimization using approximate inference. In Int. Conf. on Machine Learning, pp. 1049–1056. ACM, 2009.
-
Ψ
Psi
Ψ-learning (
Rawlik et al., 2012
);
Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.
- G-learning (
Fox et al., 2016
),
Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conf. on Uncertainty in Artificial Intelligence, 2016.
- PGQ (
O’Donoghue et al., 2016
);recent proposals in deep RL
O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. arXiv preprint arXiv:1611.01626, 2016
我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!
最后
以上就是优雅面包为你收集整理的【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies的全部内容,希望文章能够帮你解决【5分钟 Paper】Reinforcement Learning with Deep Energy-Based Policies所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复