概述
1.Policy Function Approximation
-
Policy Network π ( a ∣ s ; θ ) pi(a|s;theta) π(a∣s;θ)
使用 π ( a ∣ s ; θ ) pi(a|s;theta) π(a∣s;θ)对策略函数 π ( a ∣ s ) pi(a|s) π(a∣s)进行近似
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b9eBdiip-1672473083781)(null)]
使用Softmax可以满足 ∑ a ∈ A π ( a ∣ s ; θ ) = 1 sum_{ainmathcal{A}}pi(a|s;theta)=1 ∑a∈Aπ(a∣s;θ)=1
2.State-Value Function Approximation
Actioni-Value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_pi(s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[Ut∣St=st,At=at]
State-Value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_pi(s_t)=E_A[Q_pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]
-
Policy-Based Reinforcement Learning
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_pi(s_t)=E_A[Q_pi(s_t,A)]=sum_api(a|s_t)·Q_pi(s_t,a) Vπ(st)=EA[Qπ(st,A)]=a∑π(a∣st)⋅Qπ(st,a)
将策略函数 π ( a t ∣ s t ) pi(a_t|s_t) π(at∣st)使用Policy Network进行近似后,状态价值函数可近似为
V π ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V_pi(s_t;theta)=sum_api(a|s_t;theta)·Q_pi(s_t,a) Vπ(st;θ)=a∑π(a∣st;θ)⋅Qπ(st,a)
学习目标:改进 θ theta θ,使得 V π ( s ; θ ) V_pi(s;theta) Vπ(s;θ)更大,可将目标函数定义为
m a x i m i z e s J ( θ ) = E S [ V ( S ; θ ) ] maximizes quad J(theta)=E_S[V(S;theta)] maximizesJ(θ)=ES[V(S;θ)]
参数更新:Policy gradient ascent 策略梯度上升- 观测状态 s s s
- 更新参数 θ theta θ: θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ thetaleftarrow theta+beta·frac{partial V(s;theta)}{partialtheta} θ←θ+β⋅∂θ∂V(s;θ)
3.Policy Gradient
V ( s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V(s;theta)=sum_api(a|s;theta)·Q_pi(s,a) V(s;θ)=a∑π(a∣s;θ)⋅Qπ(s,a)
∂ V ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) = ∑ a π ( a ∣ s ; θ ) ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) = E A [ ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] begin{aligned} frac{partial V(s;theta)}{partialtheta} &=frac{partialsum_api(a|s;theta)·Q_pi(s,a)}{partialtheta} \&=sum_afrac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) \&=sum_api(a|s;theta)frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a) \&=E_A[frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a)] end{aligned} ∂θ∂V(s;θ)=∂θ∂∑aπ(a∣s;θ)⋅Qπ(s,a)=a∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)=a∑π(a∣s;θ)∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)=EA[∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)]
(以上推导并不严谨,认为 Q π ( s , a ) Q_pi(s,a) Qπ(s,a)与 θ theta θ是无关的,但由于 π pi π与 θ theta θ有关,所以假设实际上不成立。但考虑与否的推导结果相同。)
得到Policy Gradient的两种计算方法
-
方法1:
∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) frac{partial V(s;theta)}{partialtheta}=sum_afrac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) ∂θ∂V(s;θ)=a∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)
对于离散动作,对所有动作计算 f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) f(a,theta)=frac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) f(a,θ)=∂θ∂π(a∣s;θ)⋅Qπ(s,a),然后累加 -
方法2:
∂ V ( s ; θ ) ∂ θ = E A [ ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] frac{partial V(s;theta)}{partialtheta} =E_A[frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a)] ∂θ∂V(s;θ)=EA[∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)]
可以用于连续或离散动作,通过积分的方法计算期望,但是由于 π pi π是通过神经网络计算的,无法直接计算积分,故通过蒙特卡洛近似的方法计算:- 根据当前预测策略 π ( ⋅ ∣ s ; θ ) pi(·|s;theta) π(⋅∣s;θ),在动作空间内随机采样,得到动作 a ^ hat{a} a^
- 计算 g ( a ^ , θ ) = ∂ log ( π ( a ^ ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ^ ) g(hat{a},theta)=frac{partial log(pi(hat{a}|s;theta))}{partialtheta}·Q_pi(s,hat{a}) g(a^,θ)=∂θ∂log(π(a^∣s;θ))⋅Qπ(s,a^)
- g ( a ^ , θ ) g(hat{a},theta) g(a^,θ)是对 ∂ V ( s ; θ ) ∂ θ frac{partial V(s;theta)}{partialtheta} ∂θ∂V(s;θ)的无偏估计,将其作为策略梯度的近似值
4.Update policy network using policy gradient
- Observe the state s t s_t st
- Randomly sample action a t a_t at according to π ( ⋅ ∣ s t ; θ t ) pi(·|s_t;theta_t) π(⋅∣st;θt)
- Computer q t ≈ Q π ( s t , a t ) q_tapprox Q_pi(s_t,a_t) qt≈Qπ(st,at)
- Differentiate policy network: d θ , t = ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ∣ θ = θ t d_{theta,t}=frac{partial log(pi(a|s;theta))}{partialtheta}|_{theta=theta_t} dθ,t=∂θ∂log(π(a∣s;θ))∣θ=θt
- (Approximate) policy gradient: g ( a t , θ t ) = q t ⋅ d θ , t g(a_t,theta_t)=q_t·d_{theta,t} g(at,θt)=qt⋅dθ,t
- Update policy network: θ ← θ + β ⋅ g ( a t , θ t ) thetaleftarrow theta+beta·g(a_t,theta_t) θ←θ+β⋅g(at,θt)
5. q t = Q π ( s t , a t ) q_t=Q_pi(s_t,a_t) qt=Qπ(st,at)的计算
-
方法1:REINFORCE
完成一个完整过程,得到序列
s 1 , a 1 , r 1 , ⋅ ⋅ ⋅ , s T , a T , r T s_1,a_1,r_1,···,s_T,a_T,r_T s1,a1,r1,⋅⋅⋅,sT,aT,rT
计算Return u t = ∑ k = t T γ k − t r k u_t=sum_{k=t}^Tgamma^{k-t}r_k ut=∑k=tTγk−trk,由于 Q π ( s t , a t ) = E [ U t ] Q_pi(s_t,a_t)=E[U_t] Qπ(st,at)=E[Ut],故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_pi(s_t,a_t) Qπ(st,at),即
q t = u t q_t=u_t qt=ut -
方法2:使用神经网络计算,actor-critic method
故可以使用
u
t
u_t
ut近似
Q
π
(
s
t
,
a
t
)
Q_pi(s_t,a_t)
Qπ(st,at),即
q
t
=
u
t
q_t=u_t
qt=ut
- 方法2:使用神经网络计算,actor-critic method
最后
以上就是冷傲芒果为你收集整理的强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning的全部内容,希望文章能够帮你解决强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复