我是靠谱客的博主 冷傲芒果,最近开发中收集的这篇文章主要介绍强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

1.Policy Function Approximation

  • Policy Network π ( a ∣ s ; θ ) pi(a|s;theta) π(as;θ)

    使用 π ( a ∣ s ; θ ) pi(a|s;theta) π(as;θ)对策略函数 π ( a ∣ s ) pi(a|s) π(as)进行近似

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b9eBdiip-1672473083781)(null)]

    使用Softmax可以满足 ∑ a ∈ A π ( a ∣ s ; θ ) = 1 sum_{ainmathcal{A}}pi(a|s;theta)=1 aAπ(as;θ)=1

2.State-Value Function Approximation

Actioni-Value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_pi(s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[UtSt=st,At=at]

State-Value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_pi(s_t)=E_A[Q_pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]

  • Policy-Based Reinforcement Learning
    V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_pi(s_t)=E_A[Q_pi(s_t,A)]=sum_api(a|s_t)·Q_pi(s_t,a) Vπ(st)=EA[Qπ(st,A)]=aπ(ast)Qπ(st,a)
    将策略函数 π ( a t ∣ s t ) pi(a_t|s_t) π(atst)使用Policy Network进行近似后,状态价值函数可近似为
    V π ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V_pi(s_t;theta)=sum_api(a|s_t;theta)·Q_pi(s_t,a) Vπ(st;θ)=aπ(ast;θ)Qπ(st,a)
    学习目标:改进 θ theta θ,使得 V π ( s ; θ ) V_pi(s;theta) Vπ(s;θ)更大,可将目标函数定义为
    m a x i m i z e s J ( θ ) = E S [ V ( S ; θ ) ] maximizes quad J(theta)=E_S[V(S;theta)] maximizesJ(θ)=ES[V(S;θ)]
    参数更新:Policy gradient ascent 策略梯度上升

    • 观测状态 s s s
    • 更新参数 θ theta θ θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ thetaleftarrow theta+beta·frac{partial V(s;theta)}{partialtheta} θθ+βθV(s;θ)

3.Policy Gradient

V ( s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V(s;theta)=sum_api(a|s;theta)·Q_pi(s,a) V(s;θ)=aπ(as;θ)Qπ(s,a)

∂ V ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) = ∑ a π ( a ∣ s ; θ ) ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) = E A [ ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] begin{aligned} frac{partial V(s;theta)}{partialtheta} &=frac{partialsum_api(a|s;theta)·Q_pi(s,a)}{partialtheta} \&=sum_afrac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) \&=sum_api(a|s;theta)frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a) \&=E_A[frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a)] end{aligned} θV(s;θ)=θaπ(as;θ)Qπ(s,a)=aθπ(as;θ)Qπ(s,a)=aπ(as;θ)θlog(π(as;θ))Qπ(s,a)=EA[θlog(π(as;θ))Qπ(s,a)]

(以上推导并不严谨,认为 Q π ( s , a ) Q_pi(s,a) Qπ(s,a) θ theta θ是无关的,但由于 π pi π θ theta θ有关,所以假设实际上不成立。但考虑与否的推导结果相同。)

得到Policy Gradient的两种计算方法

  • 方法1:
    ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) frac{partial V(s;theta)}{partialtheta}=sum_afrac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) θV(s;θ)=aθπ(as;θ)Qπ(s,a)
    对于离散动作,对所有动作计算 f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) f(a,theta)=frac{partialpi(a|s;theta)}{partialtheta}·Q_pi(s,a) f(a,θ)=θπ(as;θ)Qπ(s,a),然后累加

  • 方法2:
    ∂ V ( s ; θ ) ∂ θ = E A [ ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] frac{partial V(s;theta)}{partialtheta} =E_A[frac{partial log(pi(a|s;theta))}{partialtheta}·Q_pi(s,a)] θV(s;θ)=EA[θlog(π(as;θ))Qπ(s,a)]
    可以用于连续或离散动作,通过积分的方法计算期望,但是由于 π pi π是通过神经网络计算的,无法直接计算积分,故通过蒙特卡洛近似的方法计算:

    • 根据当前预测策略 π ( ⋅ ∣ s ; θ ) pi(·|s;theta) π(s;θ),在动作空间内随机采样,得到动作 a ^ hat{a} a^
    • 计算 g ( a ^ , θ ) = ∂ log ⁡ ( π ( a ^ ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ^ ) g(hat{a},theta)=frac{partial log(pi(hat{a}|s;theta))}{partialtheta}·Q_pi(s,hat{a}) g(a^,θ)=θlog(π(a^s;θ))Qπ(s,a^)
    • g ( a ^ , θ ) g(hat{a},theta) g(a^,θ)是对 ∂ V ( s ; θ ) ∂ θ frac{partial V(s;theta)}{partialtheta} θV(s;θ)的无偏估计,将其作为策略梯度的近似值

4.Update policy network using policy gradient

  • Observe the state s t s_t st
  • Randomly sample action a t a_t at according to π ( ⋅ ∣ s t ; θ t ) pi(·|s_t;theta_t) π(st;θt)
  • Computer q t ≈ Q π ( s t , a t ) q_tapprox Q_pi(s_t,a_t) qtQπ(st,at)
  • Differentiate policy network: d θ , t = ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ∣ θ = θ t d_{theta,t}=frac{partial log(pi(a|s;theta))}{partialtheta}|_{theta=theta_t} dθ,t=θlog(π(as;θ))θ=θt
  • (Approximate) policy gradient: g ( a t , θ t ) = q t ⋅ d θ , t g(a_t,theta_t)=q_t·d_{theta,t} g(at,θt)=qtdθ,t
  • Update policy network: θ ← θ + β ⋅ g ( a t , θ t ) thetaleftarrow theta+beta·g(a_t,theta_t) θθ+βg(at,θt)

5. q t = Q π ( s t , a t ) q_t=Q_pi(s_t,a_t) qt=Qπ(st,at)的计算

  • 方法1:REINFORCE

    完成一个完整过程,得到序列
    s 1 , a 1 , r 1 , ⋅ ⋅ ⋅ , s T , a T , r T s_1,a_1,r_1,···,s_T,a_T,r_T s1,a1,r1,⋅⋅⋅,sT,aT,rT
    计算Return u t = ∑ k = t T γ k − t r k u_t=sum_{k=t}^Tgamma^{k-t}r_k ut=k=tTγktrk,由于 Q π ( s t , a t ) = E [ U t ] Q_pi(s_t,a_t)=E[U_t] Qπ(st,at)=E[Ut],故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_pi(s_t,a_t) Qπ(st,at),即
    q t = u t q_t=u_t qt=ut

  • 方法2:使用神经网络计算,actor-critic method

故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_pi(s_t,a_t) Qπ(st,at),即
q t = u t q_t=u_t qt=ut

  • 方法2:使用神经网络计算,actor-critic method

最后

以上就是冷傲芒果为你收集整理的强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning的全部内容,希望文章能够帮你解决强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(36)

评论列表共有 0 条评论

立即
投稿
返回
顶部