概述
BCQ
- 文章解读
- 1.主要问题
- 2.采用了哪些方法与框架
- 2.1策略约束
- 2.1.1 解决外推误差
- 2.2BCQ算法
- 3.代码部分
- 3.1 main
- 3.1.1 interact_with_environment
- 3.1.2 train_BCQ
- 3.1.3 main
- 3.2 DDPG
- 3.2.1 Actor
- 3.2.2 Critic
- 3.2.3 DDPG
- 3.3BCQ
- 3.3.1Actor
- 3.3.2Critic
- 3.3.3VAE
- 3.3.4BCQ
- 3.4utils
- 3.4.1 ReplayBuffer
- 4.总结
- 4.1 数据采取方法
- 4.2 批约束思想
- 4.3 Q值估计的改进
文章解读
1.主要问题
-
data absent 数据缺失问题,由于离线强化只依靠数据集,并不能包含所有的数据;
-
distribution shift分布偏移,分布偏移最主要的原因是learned policy 和 behavior policy之间的偏移,这也是offlineRL相比于Online RL在不能交互学习的情况下造成的。
2.采用了哪些方法与框架
2.1策略约束
如果能够使得(s,a)尽可能的与数据集D中的数据相似,就可以解决上述问题。
为此提出了以下解决思路。
(1).最小化所选择的动作和数据集中存在的工作的距离
(2).能转移到和数据集中状态相似的状态
(3).最大化值函数
其中最为重要的的就是第一点,只有保证了第一布的成功,才能准确的估计第二步和第三步。
2.1.1 解决外推误差
首先给MDP MB中的数据转移PB给出了定义:
接下来提出了一个误差函数的定义,用于解释为何会出现这些误差(其中Q为我们需要推导的实际价值函数,QB是依据数据集推导的价值函数):
通过推导可以化为以下形式:
通过这个推导我们可以发现当PM与PB的概率一样的时候,可以得出结论EMDP=0,从而使得以下公式误差的值也为0:
有了以上的推论,当eMDP=0,且初始状态S0存在的时候,可以将批策略约束与Q-Learning(BCQL)结合,得到以下公式:
通过以上的推导公式可以得到以下两个定理:
一.学习率为α,通过对环境标准的采样,BCQL 可以收敛到最优动作值函数 。
二.给定确定性 MDP 和 coherent 数据集B,学习率为α,BCQL 将会收敛到
,其中是最优 batch-constrained 策略。
2.2BCQ算法
将 BCQL 算法拓展到连续环境,提出了 BCQ 算法。其中为了满足 Batch-constrained 的条件,BCQ 利用了一个生成模型VAE。对于给定的状态,BCQ 利用生成模型来生成与 batch 相似的动作集合,并通过 Q 网络来选择价值最高的动作。另外,还对价值估计过程增加了对未来稀有的状态进行惩罚,与 Clipped Double Q-learning 算法类似。最后,BCQ 能学到与数据集的状态动作对访问分布相似的策略。
对于给定的状态(s,a)和数据集D中的数据状态对相似度生成概率函数来减小外推误差所造成的错误。但是直接估计比较困难,所以提出了用来模拟P函数。在这里我们所用的方式是用VAE模型来近似,并且与一起选择动作估值最大的函数。但是在此还加入了扰动模型来增加他的探索度,其中扰动模型是服从再加上一范围限制,最终得到了以下的行为策略:
在上述的式子之中n和的数据决定了采用的是模仿学习还是强化学习。当n=1,=0时,该算法是模仿学习及一比一的还原数据集D中的策略,但是当且时BCQ 算法就类似于Q-learning 算法.扰动模型的训练和 DDPG算法的训练目标类似:
在最后,BCQ 采用了 Clipped Double Q-learning 算法 对动作值进行估计,也就是训练两个动作值网络,从中选取取它们的最小值作为动作值的估计。改进 Clipped Double Q-learning 算法,对两个动作值采用新的结合方式:
最终得到的伪代码图如下所示:
中间VAE部分的模型推理如下所示:
VAE Gω由两个网络定义,编码器Eω1(s,a)和解码器Dω2(s,z),其中ω={ω1,ω2}。编码器获取状态-动作对,并输出高斯分布N(µ,σ)的平均值µ和标准偏差σ。状态s,连同从高斯中采样的潜在向量z一起,被传递到解Dω2(s,z),该解码器输出动作。马网络遵循默认架构(图10),有两个大小为750的隐藏层,而不是400和300相对于重建的均方误差以及KL正则化项进行训练:
注意到两种分布的高斯形式,KL散度项可以简化:
3.代码部分
本代码主要由四个部分组成:main,DDPG,BCQ,utils
3.1 main
main主要分为三个部分。
第一部分:主要是利用DDPG跑出我们所需要的100w个数据
第二部分:主要是对于BCQ的训练
第三部分:从跑出来的数据集中提取数据用于BCQ的训练,同时设置好相关的参数
3.1.1 interact_with_environment
//主要用于与环境交互产生所需要的数据
def interact_with_environment(env, state_dim, action_dim, max_action, device, args):
# For saving files
setting = f"{args.env}_{args.seed}"
buffer_name = f"{args.buffer_name}_{setting}"
# Initialize and load policy
policy = DDPG.DDPG(state_dim, action_dim, max_action, device)#, args.discount, args.tau)
if args.generate_buffer: policy.load(f"./models/behavioral_{setting}")
# Initialize buffer
replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)
evaluations = []
state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num = 0
# Interact with the environment for max_timesteps
for t in range(int(args.max_timesteps)):
episode_timesteps += 1
# Select action with noise
if (
(args.generate_buffer and np.random.uniform(0, 1) < args.rand_action_p) or
(args.train_behavioral and t < args.start_timesteps)
):
action = env.action_space.sample()
else:
action = (
policy.select_action(np.array(state))
+ np.random.normal(0, max_action * args.gaussian_std, size=action_dim)
).clip(-max_action, max_action)
# Perform action
next_state, reward, done, _ = env.step(action)
done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0
# Store data in replay buffer
replay_buffer.add(state, action, next_state, reward, done_bool)
state = next_state
episode_reward += reward
# Train agent after collecting sufficient data
if args.train_behavioral and t >= args.start_timesteps:
policy.train(replay_buffer, args.batch_size)
if done:
# +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
# Reset environment
state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num += 1
# Evaluate episode
if args.train_behavioral and (t + 1) % args.eval_freq == 0:
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/behavioral_{setting}", evaluations)
policy.save(f"./models/behavioral_{setting}")
# Save final policy
if args.train_behavioral:
policy.save(f"./models/behavioral_{setting}")
# Save final buffer and performance
else:
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/buffer_performance_{setting}", evaluations)
replay_buffer.save(f"./buffers/{buffer_name}")
3.1.2 train_BCQ
//用于训练BCQ的代码
def train_BCQ(state_dim, action_dim, max_action, device, args):
# For saving files
setting = f"{args.env}_{args.seed}"
buffer_name = f"{args.buffer_name}_{setting}"
# Initialize policy
policy = BCQ.BCQ(state_dim, action_dim, max_action, device, args.discount, args.tau, args.lmbda, args.phi)
# Load buffer
replay_buffer = utils.ReplayBuffer(state_dim, action_dim, device)
replay_buffer.load(f"./buffers/{buffer_name}")
evaluations = []
episode_num = 0
done = True
training_iters = 0
while training_iters < args.max_timesteps:
pol_vals = policy.train(replay_buffer, iterations=int(args.eval_freq), batch_size=args.batch_size)
evaluations.append(eval_policy(policy, args.env, args.seed))
np.save(f"./results/BCQ_{setting}", evaluations)
training_iters += args.eval_freq
print(f"Training iterations: {training_iters}")
3.1.3 main
f __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--env", default="Hopper-v3") # OpenAI gym 的环境名称environment name
parser.add_argument("--seed", default=0, type=int) # 设置Gym, PyTorch and Numpy seeds
parser.add_argument("--buffer_name", default="Robust") # 保存的文件名称
parser.add_argument("--eval_freq", default=5e3, type=float) # 更新频率
parser.add_argument("--max_timesteps", default=1e6, type=int) # 训练的最大步长
parser.add_argument("--start_timesteps", default=25e3, type=int)# 运行的最大步长(或者说是缓冲区大小)
parser.add_argument("--rand_action_p", default=0.3, type=float) # 批处理中选取随机动作的概率
parser.add_argument("--gaussian_std", default=0.3, type=float) # 高斯分布噪声的标准差
parser.add_argument("--batch_size", default=100, type=int) # 从数据集中抽取的最小样本数量
parser.add_argument("--discount", default=0.99) # 奖励折扣参数
parser.add_argument("--tau", default=0.005) # 目标网络更新参数
parser.add_argument("--lmbda", default=0.75) # 在BCQ中 clipped double Q-learning的权重
parser.add_argument("--phi", default=0.05) # BCQ中的扰动最大超参数
parser.add_argument("--train_behavioral", action="store_true") # If true, train behavioral (DDPG)
parser.add_argument("--generate_buffer", action="store_true") # If true, generate buffer
args = parser.parse_args()
print("---------------------------------------")
if args.train_behavioral:
print(f"Setting: Training behavioral, Env: {args.env}, Seed: {args.seed}")
elif args.generate_buffer:
print(f"Setting: Generating buffer, Env: {args.env}, Seed: {args.seed}")
else:
print(f"Setting: Training BCQ, Env: {args.env}, Seed: {args.seed}")
print("---------------------------------------")
if args.train_behavioral and args.generate_buffer:
print("Train_behavioral and generate_buffer cannot both be true.")
exit()
if not os.path.exists("./results"):
os.makedirs("./results")
if not os.path.exists("./models"):
os.makedirs("./models")
if not os.path.exists("./buffers"):
os.makedirs("./buffers")
env = gym.make(args.env)
env.seed(args.seed)
env.action_space.seed(args.seed)
torch.manual_seed(args.seed)
np.random.seed(args.seed)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if args.train_behavioral or args.generate_buffer:
interact_with_environment(env, state_dim, action_dim, max_action, device, args)
else:
train_BCQ(state_dim, action_dim, max_action, device, args
3.2 DDPG
第一部分:Actor网络的搭建,用于选择出动作a
第二部分:Critic网络的构建,同与价值及函数Q的评估
第三部分:利用Actor和Critic训练同时进行参数的更新
3.2.1 Actor
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
def forward(self, state):
a = F.relu(self.l1(state))
a = F.relu(self.l2(a))
return self.max_action * torch.tanh(self.l3(a))
3.2.2 Critic
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
def forward(self, state, action):
q = F.relu(self.l1(torch.cat([state, action], 1)))
q = F.relu(self.l2(q))
return self.l3(q)
3.2.3 DDPG
class DDPG(object):
//初始化参数
def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005):
self.actor = Actor(state_dim, action_dim, max_action).to(device)
self.actor_target = copy.deepcopy(self.actor)
self.actor_optimizer = torch.optim.Adam(self.actor.parameters())
self.critic = Critic(state_dim, action_dim).to(device)
self.critic_target = copy.deepcopy(self.critic)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters())
self.discount = discount
self.tau = tau
self.device = device
//选择动作
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1)).to(self.device)
return self.actor(state).cpu().data.numpy().flatten()
//训练
def train(self, replay_buffer, batch_size=100):
# Sample replay buffer
state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)
# 计算目标Q值
target_Q = self.critic_target(next_state, self.actor_target(next_state))
target_Q = reward + (not_done * self.discount * target_Q).detach()
# 计算当前Q值
current_Q = self.critic(state, action)
# 计算critic loss
critic_loss = F.mse_loss(current_Q, target_Q)
# 优化Critic参数
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# 计算动作loss
actor_loss = -self.critic(state, self.actor(state)).mean()
# Optimize the actor
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 更新参数
for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
//保存数据
def save(self, filename):
torch.save(self.critic.state_dict(), filename + "_critic")
torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")
torch.save(self.actor.state_dict(), filename + "_actor")
torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")
//加载模型
def load(self, filename):
self.critic.load_state_dict(torch.load(filename + "_critic"))
self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))
self.critic_target = copy.deepcopy(self.critic)
self.actor.load_state_dict(torch.load(filename + "_actor"))
self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))
self.actor_target = copy.deepcopy(self.actor)
3.3BCQ
BCQ主要分为了四个部分:
第一部分:Actor网络的搭建,用于选择出动作a
第二部分:Critic网络的构建,同与价值及函数Q的评估
第三部分:VAE网络的搭建,用于对数据集D中的行为策略进行模拟,同时给出相应的数据
第四部分:训练
3.3.1Actor
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action, phi=0.05):
super(Actor, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, action_dim)
self.max_action = max_action
self.phi = phi
def forward(self, state, action):
a = F.relu(self.l1(torch.cat([state, action], 1)))
a = F.relu(self.l2(a))
a = self.phi * self.max_action * torch.tanh(self.l3(a))
return (a + action).clamp(-self.max_action, self.max_action)
3.3.2Critic
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.l1 = nn.Linear(state_dim + action_dim, 400)
self.l2 = nn.Linear(400, 300)
self.l3 = nn.Linear(300, 1)
self.l4 = nn.Linear(state_dim + action_dim, 400)
self.l5 = nn.Linear(400, 300)
self.l6 = nn.Linear(300, 1)
def forward(self, state, action):
q1 = F.relu(self.l1(torch.cat([state, action], 1)))
q1 = F.relu(self.l2(q1))
q1 = self.l3(q1)
q2 = F.relu(self.l4(torch.cat([state, action], 1)))
q2 = F.relu(self.l5(q2))
q2 = self.l6(q2)
return q1, q2
def q1(self, state, action):
q1 = F.relu(self.l1(torch.cat([state, action], 1)))
q1 = F.relu(self.l2(q1))
q1 = self.l3(q1)
return q1
3.3.3VAE
class VAE(nn.Module):
def __init__(self, state_dim, action_dim, latent_dim, max_action, device):
super(VAE, self).__init__()
self.e1 = nn.Linear(state_dim + action_dim, 750)
self.e2 = nn.Linear(750, 750)
//均值
self.mean = nn.Linear(750, latent_dim)
//方差
self.log_std = nn.Linear(750, latent_dim)
self.d1 = nn.Linear(state_dim + latent_dim, 750)
self.d2 = nn.Linear(750, 750)
self.d3 = nn.Linear(750, action_dim)
self.max_action = max_action
self.latent_dim = latent_dim
self.device = device
def forward(self, state, action):
z = F.relu(self.e1(torch.cat([state, action], 1)))
z = F.relu(self.e2(z))
mean = self.mean(z)
# Clamped for numerical stability
log_std = self.log_std(z).clamp(-4, 15)
std = torch.exp(log_std)
z = mean + std * torch.randn_like(std)
//选出动作
u = self.decode(state, z)
return u, mean, std
//解码器用于选出动作
def decode(self, state, z=None):
# When sampling from the VAE, the latent vector is clipped to [-0.5, 0.5]
if z is None:
z = torch.randn((state.shape[0], self.latent_dim)).to(self.device).clamp(-0.5,0.5)
a = F.relu(self.d1(torch.cat([state, z], 1)))
a = F.relu(self.d2(a))
return self.max_action * torch.tanh(self.d3(a))
3.3.4BCQ
class BCQ(object):
def __init__(self, state_dim, action_dim, max_action, device, discount=0.99, tau=0.005, lmbda=0.75, phi=0.05):
latent_dim = action_dim * 2
self.actor = Actor(state_dim, action_dim, max_action, phi).to(device)
self.actor_target = copy.deepcopy(self.actor)
self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-3)
self.critic = Critic(state_dim, action_dim).to(device)
self.critic_target = copy.deepcopy(self.critic)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=1e-3)
self.vae = VAE(state_dim, action_dim, latent_dim, max_action, device).to(device)
self.vae_optimizer = torch.optim.Adam(self.vae.parameters())
self.max_action = max_action
self.action_dim = action_dim
self.discount = discount
self.tau = tau
self.lmbda = lmbda
self.device = device
//选出动作
def select_action(self, state):
with torch.no_grad():
state = torch.FloatTensor(state.reshape(1, -1)).repeat(100, 1).to(self.device)
action = self.actor(state, self.vae.decode(state))
q1 = self.critic.q1(state, action)
ind = q1.argmax(0)
return action[ind].cpu().data.numpy().flatten()
def train(self, replay_buffer, iterations, batch_size=100):
for it in range(iterations):
# Sample replay buffer / batch
state, action, next_state, reward, not_done = replay_buffer.sample(batch_size)
#编码器训练
recon, mean, std = self.vae(state, action)
recon_loss = F.mse_loss(recon, action)
KL_loss = -0.5 * (1 + torch.log(std.pow(2)) - mean.pow(2) - std.pow(2)).mean()
vae_loss = recon_loss + 0.5 * KL_loss
self.vae_optimizer.zero_grad()
vae_loss.backward()
self.vae_optimizer.step()
# Critic Training
with torch.no_grad():
# Duplicate next state 10 times
next_state = torch.repeat_interleave(next_state, 10, 0)
# Compute value of perturbed actions sampled from the VAE
target_Q1, target_Q2 = self.critic_target(next_state, self.actor_target(next_state, self.vae.decode(next_state)))
# Soft Clipped Double Q-learning
target_Q = self.lmbda * torch.min(target_Q1, target_Q2) + (1. - self.lmbda) * torch.max(target_Q1, target_Q2)
# Take max over each action sampled from the VAE
target_Q = target_Q.reshape(batch_size, -1).max(1)[0].reshape(-1, 1)
target_Q = reward + not_done * self.discount * target_Q
current_Q1, current_Q2 = self.critic(state, action)
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Pertubation Model / Action Training
sampled_actions = self.vae.decode(state)
perturbed_actions = self.actor(state, sampled_actions)
# Update through DPG
actor_loss = -self.critic.q1(state, perturbed_actions).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
//更新目标网络
for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
3.4utils
主要用于数据的存储和读取,大多是这一套模板
3.4.1 ReplayBuffer
class ReplayBuffer(object):
def __init__(self, state_dim, action_dim, device, max_size=int(1e6)):
self.max_size = max_size
//数据所在编号
self.ptr = 0
//存储数据数目
self.size = 0
self.state = np.zeros((max_size, state_dim))
self.action = np.zeros((max_size, action_dim))
self.next_state = np.zeros((max_size, state_dim))
self.reward = np.zeros((max_size, 1))
self.not_done = np.zeros((max_size, 1))
self.device = device
//添加新数据
def add(self, state, action, next_state, reward, done):
self.state[self.ptr] = state
self.action[self.ptr] = action
self.next_state[self.ptr] = next_state
self.reward[self.ptr] = reward
self.not_done[self.ptr] = 1. - done
self.ptr = (self.ptr + 1) % self.max_size
self.size = min(self.size + 1, self.max_size)
//抽取数据
def sample(self, batch_size):
//所抽取数据的编号
ind = np.random.randint(0, self.size, size=batch_size)
return (
torch.FloatTensor(self.state[ind]).to(self.device),
torch.FloatTensor(self.action[ind]).to(self.device),
torch.FloatTensor(self.next_state[ind]).to(self.device),
torch.FloatTensor(self.reward[ind]).to(self.device),
torch.FloatTensor(self.not_done[ind]).to(self.device)
)
//数据保存
def save(self, save_folder):
np.save(f"{save_folder}_state.npy", self.state[:self.size])
np.save(f"{save_folder}_action.npy", self.action[:self.size])
np.save(f"{save_folder}_next_state.npy", self.next_state[:self.size])
np.save(f"{save_folder}_reward.npy", self.reward[:self.size])
np.save(f"{save_folder}_not_done.npy", self.not_done[:self.size])
np.save(f"{save_folder}_ptr.npy", self.ptr)
//模型加载
def load(self, save_folder, size=-1):
reward_buffer = np.load(f"{save_folder}_reward.npy")
# Adjust crt_size if we're using a custom size
size = min(int(size), self.max_size) if size > 0 else self.max_size
self.size = min(reward_buffer.shape[0], size)
self.state[:self.size] = np.load(f"{save_folder}_state.npy")[:self.size]
self.action[:self.size] = np.load(f"{save_folder}_action.npy")[:self.size]
self.next_state[:self.size] = np.load(f"{save_folder}_next_state.npy")[:self.size]
self.reward[:self.size] = reward_buffer[:self.size]
self.not_done[:self.size] = np.load(f"{save_folder}_not_done.npy")[:self.size]
4.总结
此章节主要对BCQ采用的方法作出总结,同时对这些方法做出一些拓展,探索其他的可能,并作出对比。
4.1 数据采取方法
此论文用了DDPG作为采集数据的方式,为何采取此方法主要原因有以下几点:
- DDPG是off policy算法可以采用经验池进行优化,很契合离线强化的数据集采样方式,同时可以与BCQ算法作对比,来展示为何普通的off policy算法不可以直接应用到离线强化上。
- BCQ算法的一些内容是基于DDPG的内容上做改进的,和DDPG有共通之处,例如:
但是可以看到也与TD3的算法非常的类似:
选取动作方面都采用了类似的DPG加一个扰动的思想。
4.2 批约束思想
BCQ为了解决out-of-distribution和distribution shift的问题而提出来的约束想法,主要思想是将行为策略与实际策略结合在一起,将选取的状态,动作对尽可能的限制在已知的数据集B之中。为此采取了批约束的想法,通过VAE模拟和扰动模型的加入来进行训练,但是这很受数据集好坏的约束,因为BCQ探索的策略大多与数据集之中的策略高度相似。所以针对这一点,我们可以提出一些想法,BCQ之中是完全的模拟数据集的状态转移概率,那我们是否可以只保证我们所挑选的状态,动作对在数据集中存在,但是概率却是随意的不必强行匹配,建立一个散度来设定一个范围,保证一定约束的同时,在设立一个阈值确保不会超出这个范围。虽然这个可以保证了不会出现out-of-distribution问题的存在,但distribution shift问题还是会出现的,所以如何解决分布偏移问题又是我们所学要考虑的一个点。
4.3 Q值估计的改进
我们可以看到他对于Q值估计的函数也作了改进:
这个公式是对Clipped Double Q-learning 公式的改进,
他所作的改进是取得两个值的凸组合,同谁给予最小值上更高的权重,这样在削减了高估的同时也能减小不常见状态的影响。
最后
以上就是雪白仙人掌为你收集整理的学习笔记第一篇文章解读的全部内容,希望文章能够帮你解决学习笔记第一篇文章解读所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复