概述
DDPG是deep deterministic policy gradient深度确定性策略梯度算法的缩写。
环境
控制目标是通过力输入控制一个质量块的位置。
env = rlPredefinedEnv("DoubleIntegrator-Continuous")
质量块做一维运动,边界为[-4m,+4m];
可观测量为质量块的位置和速度;
迭代终止条件:质量块移动距离超过5m或;
每一个时间步长的奖励由下式定义:
其中:
是质量块的状态向量;
是施加的力;
是控制性能的权重;
是控制效果;
以上信息也可通过查看env属性看到。
env = DoubleIntegratorContinuousAction with properties: Gain: 1 Ts: 0.1000 MaxDistance: 5 GoalThreshold: 0.0100 Q: [2x2 double] R: 0.0100 MaxForce: Inf State: [2x1 double]
观测器信息:
obsInfo = getObservationInfo(env)
numObservations = obsInfo.Dimension(1)
obsInfo =
LowerLimit: -Inf
可见对x和dx的观测范围为到。第一行表示位置,第二行表示速度。
动作空间信息:
actInfo = getActionInfo(env)
numActions = numel(actInfo)
actInfo =
LowerLimit: -Inf
随机数种子重置
rng(0)
生成DDPG Agent
Critic网络
statePath = imageInputLayer([numObservations 1 1],'Normalization','none','Name','state');
actionPath = imageInputLayer([numActions 1 1],'Normalization','none','Name','action');
commonPath = [concatenationLayer(1,2,'Name','concat')
quadraticLayer('Name','quadratic')
fullyConnectedLayer(1,'Name','StateValue','BiasLearnRateFactor',0,'Bias',0)];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'state','concat/in1');
criticNetwork = connectLayers(criticNetwork,'action','concat/in2');
如下图所示
plot(criticNetwork)
criticOpts = rlRepresentationOptions('LearnRate',5e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);
Actor网络
actorNetwork = [
imageInputLayer([numObservations 1 1],'Normalization','none','Name','state')
fullyConnectedLayer(numActions,'Name','action','BiasLearnRateFactor',0,'Bias',0)];
actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);
Agent
Agent参数配置
agentOpts = rlDDPGAgentOptions(...
'SampleTime',env.Ts,...
'TargetSmoothFactor',1e-3,...
'ExperienceBufferLength',1e6,...
'DiscountFactor',0.99,...
'MiniBatchSize',32);
agentOpts.NoiseOptions.StandardDeviation = 0.3;
agentOpts.NoiseOptions.StandardDeviationDecayRate = 1e-6;
Agent组装生成【说明:这一行为标准写法】
agent = rlDDPGAgent(actor,critic,agentOpts);
训练Agent
train参数配置
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 5000, ...
'MaxStepsPerEpisode', 200, ...
'Verbose', false, ...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',-66);
开始训练
trainingStats = train(agent,env,trainOpts)
训练过程如下图所示
官方帮助文档中训练到第3430步才达到终止条件'StopTrainingValue',-66。本文训练了3个小时才到500步左右,不过基本趋势已经接近最终结果,震荡期已过。
最后
以上就是安详招牌为你收集整理的【MATLAB强化学习工具箱】学习笔记--训练DDPG智能体控制二阶双积分系统Train DDPG Agent to Control Double Integrator System环境生成DDPG Agent训练Agent的全部内容,希望文章能够帮你解决【MATLAB强化学习工具箱】学习笔记--训练DDPG智能体控制二阶双积分系统Train DDPG Agent to Control Double Integrator System环境生成DDPG Agent训练Agent所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复