我是靠谱客的博主 诚心悟空,这篇文章主要介绍强化学习之Q-learning简介强化学习Q-learningDeep-Q-learning参考资料,现在分享给大家,希望可以做个参考。

这里写图片描述

强化学习在alphago中大放异彩,本文将简要介绍强化学习的一种q-learning。先从最简单的q-table下手,然后针对state过多的问题引入q-network,最后通过两个例子加深对q-learning的理解。

  • 强化学习
  • Q-learning
    • Q-Table
    • Bellman Equation
    • 算法
    • 实例
  • Deep-Q-learning
    • Experience replay
    • Exploration - Exploitation
    • 算法
    • 实例
      • CartPole
      • FrozenLake
  • 参考资料

强化学习

强化学习通常包括两个实体agentenvironment。两个实体的交互如下,在environmentstate st 下,agent采取action at 进而得到reward rt 并进入state st+1

这里写图片描述

强化学习的问题,通常有如下特点:

  • 不同的action产生不同的reward
  • reward有延迟性
  • 对某个action的reward是基于当前的state的

Q-learning

Q-Table

Q-learning的核心是Q-table。Q-table的行和列分别表示stateaction的值,Q-table的值 Q(s,a) 衡量当前states采取actiona到底有多好。

这里写图片描述

Bellman Equation

在训练的过程中,我们使用Bellman Equation去更新Q-table。

Q(s,a)=r+γ(max(Q(s,a))

Bellman Equation解释如下: Q(s,a) 表示成当前 s 采取a后的即时 r ,加上折价γ后的最大reward max(Q(s,a)

算法

根据Bellman Equation,学习的最终目的是得到Q-table,算法如下:

  1. 外循环模拟次数num_episodes
  2. 内循环每次模拟最大步数num_steps
  3. 根据当前的state和q-table选择action(可加入随机性)
  4. 根据当前的state和action获得下一步的state和reward
  5. 更新q-table: Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])

实例

以FrozenLake为例,代码如下:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# import lib import gym import numpy as np # Load the environment env = gym.make('FrozenLake-v0') # Implement Q-Table learning algorithm #Initialize table with all zeros Q = np.zeros([env.observation_space.n,env.action_space.n]) # Set learning parameters lr = .8 y = .95 num_episodes = 2000 #create lists to contain total rewards and steps per episode #jList = [] rList = [] for i in range(num_episodes): #Reset environment and get first new observation s = env.reset() rAll = 0 d = False j = 0 #The Q-Table learning algorithm while j < 99: j+=1 #Choose an action by greedily (with noise) picking from Q table a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1))) #Get new state and reward from environment s1,r,d,_ = env.step(a) #Update Q-Table with new knowledge Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a]) rAll += r s = s1 if d == True: break #jList.append(j) rList.append(rAll) print "Score over time: " + str(sum(rList)/num_episodes) print "Final Q-Table Values" print Q

Deep-Q-learning

q-table存在一个问题,真实情况的state可能无穷多,这样q-table就会无限大,解决这个问题的办法是通过神经网络实现q-table。输入state,输出不同action的q-value。

这里写图片描述

这里写图片描述

Experience replay

强化学习由于state之间的相关性存在稳定性的问题,解决的办法是在训练的时候存储当前训练的状态到记忆体 M ,更新参数的时候随机从M中抽样mini-batch进行更新。

具体地, M 中存储的数据类型为 <s,a,r,s> M <script type="math/tex" id="MathJax-Element-772">M</script>有最大长度的限制,以保证更新采用的数据都是最近的数据。

Exploration - Exploitation

  • Exploration:在刚开始训练的时候,为了能够看到更多可能的情况,需要对action加入一定的随机性。
  • Exploitation:随着训练的加深,逐渐降低随机性,也就是降低随机action出现的概率。

算法

这里写图片描述

实例

CartPole

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# import lib import gym import tensorflow as tf import numpy as np # Create the Cart-Pole game environment env = gym.make('CartPole-v0') # Q-network class QNetwork: def __init__(self, learning_rate=0.01, state_size=4, action_size=2, hidden_size=10, name='QNetwork'): # state inputs to the Q-network with tf.variable_scope(name): self.inputs_ = tf.placeholder(tf.float32, [None, state_size], name='inputs') # One hot encode the actions to later choose the Q-value for the action self.actions_ = tf.placeholder(tf.int32, [None], name='actions') one_hot_actions = tf.one_hot(self.actions_, action_size) # Target Q values for training self.targetQs_ = tf.placeholder(tf.float32, [None], name='target') # ReLU hidden layers self.fc1 = tf.contrib.layers.fully_connected(self.inputs_, hidden_size) self.fc2 = tf.contrib.layers.fully_connected(self.fc1, hidden_size) # Linear output layer self.output = tf.contrib.layers.fully_connected(self.fc2, action_size, activation_fn=None) ### Train with loss (targetQ - Q)^2 # output has length 2, for two actions. This next line chooses # one value from output (per row) according to the one-hot encoded actions. self.Q = tf.reduce_sum(tf.multiply(self.output, one_hot_actions), axis=1) self.loss = tf.reduce_mean(tf.square(self.targetQs_ - self.Q)) self.opt = tf.train.AdamOptimizer(learning_rate).minimize(self.loss) # Experience replay from collections import deque class Memory(): def __init__(self, max_size = 1000): self.buffer = deque(maxlen=max_size) def add(self, experience): self.buffer.append(experience) def sample(self, batch_size): idx = np.random.choice(np.arange(len(self.buffer)), size=batch_size, replace=False) return [self.buffer[ii] for ii in idx] # hyperparameters train_episodes = 1000 # max number of episodes to learn from max_steps = 200 # max steps in an episode gamma = 0.99 # future reward discount # Exploration parameters explore_start = 1.0 # exploration probability at start explore_stop = 0.01 # minimum exploration probability decay_rate = 0.0001 # exponential decay rate for exploration prob # Network parameters hidden_size = 64 # number of units in each Q-network hidden layer learning_rate = 0.0001 # Q-network learning rate # Memory parameters memory_size = 10000 # memory capacity batch_size = 20 # experience mini-batch size pretrain_length = batch_size # number experiences to pretrain the memory tf.reset_default_graph() mainQN = QNetwork(name='main', hidden_size=hidden_size, learning_rate=learning_rate) # Populate the experience memory # Initialize the simulation env.reset() # Take one random step to get the pole and cart moving state, reward, done, _ = env.step(env.action_space.sample()) memory = Memory(max_size=memory_size) # Make a bunch of random actions and store the experiences for ii in range(pretrain_length): # Uncomment the line below to watch the simulation # env.render() # Make a random action action = env.action_space.sample() next_state, reward, done, _ = env.step(action) if done: # The simulation fails so no next state next_state = np.zeros(state.shape) # Add experience to memory memory.add((state, action, reward, next_state)) # Start new episode env.reset() # Take one random step to get the pole and cart moving state, reward, done, _ = env.step(env.action_space.sample()) else: # Add experience to memory memory.add((state, action, reward, next_state)) state = next_state # Training # Now train with experiences saver = tf.train.Saver() rewards_list = [] with tf.Session() as sess: # Initialize variables sess.run(tf.global_variables_initializer()) step = 0 for ep in range(1, train_episodes): total_reward = 0 t = 0 while t < max_steps: step += 1 # Uncomment this next line to watch the training env.render() # Explore or Exploit explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) if explore_p > np.random.rand(): # Make a random action action = env.action_space.sample() else: # Get action from Q-network feed = {mainQN.inputs_: state.reshape((1, *state.shape))} Qs = sess.run(mainQN.output, feed_dict=feed) action = np.argmax(Qs) # Take action, get new state and reward next_state, reward, done, _ = env.step(action) total_reward += reward if done: # the episode ends so no next state next_state = np.zeros(state.shape) t = max_steps print('Episode: {}'.format(ep), 'Total reward: {}'.format(total_reward), 'Training loss: {:.4f}'.format(loss), 'Explore P: {:.4f}'.format(explore_p)) rewards_list.append((ep, total_reward)) # Add experience to memory memory.add((state, action, reward, next_state)) # Start new episode env.reset() # Take one random step to get the pole and cart moving state, reward, done, _ = env.step(env.action_space.sample()) else: # Add experience to memory memory.add((state, action, reward, next_state)) state = next_state t += 1 # Sample mini-batch from memory batch = memory.sample(batch_size) states = np.array([each[0] for each in batch]) actions = np.array([each[1] for each in batch]) rewards = np.array([each[2] for each in batch]) next_states = np.array([each[3] for each in batch]) # Train network target_Qs = sess.run(mainQN.output, feed_dict={mainQN.inputs_: next_states}) # Set target_Qs to 0 for states where episode ends episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1) target_Qs[episode_ends] = (0, 0) targets = rewards + gamma * np.max(target_Qs, axis=1) loss, _ = sess.run([mainQN.loss, mainQN.opt], feed_dict={mainQN.inputs_: states, mainQN.targetQs_: targets, mainQN.actions_: actions}) saver.save(sess, "checkpoints/cartpole.ckpt") # Testing test_episodes = 10 test_max_steps = 400 env.reset() with tf.Session() as sess: saver.restore(sess, tf.train.latest_checkpoint('checkpoints')) for ep in range(1, test_episodes): t = 0 while t < test_max_steps: env.render() # Get action from Q-network feed = {mainQN.inputs_: state.reshape((1, *state.shape))} Qs = sess.run(mainQN.output, feed_dict=feed) action = np.argmax(Qs) # Take action, get new state and reward next_state, reward, done, _ = env.step(action) if done: t = test_max_steps env.reset() # Take one random step to get the pole and cart moving state, reward, done, _ = env.step(env.action_space.sample()) else: state = next_state t += 1 env.close()

FrozenLake

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# import lib import gym import numpy as np import random import tensorflow as tf import matplotlib.pyplot as plt %matplotlib inline # laod env env = gym.make('FrozenLake-v0') # The Q-Network Approach tf.reset_default_graph() #These lines establish the feed-forward part of the network used to choose actions inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32) W = tf.Variable(tf.random_uniform([16,4],0,0.01)) Qout = tf.matmul(inputs1,W) predict = tf.argmax(Qout,1) #Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values. nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32) loss = tf.reduce_sum(tf.square(nextQ - Qout)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1) updateModel = trainer.minimize(loss) # Training init = tf.initialize_all_variables() # Set learning parameters y = .99 e = 0.1 num_episodes = 2000 #create lists to contain total rewards and steps per episode jList = [] rList = [] with tf.Session() as sess: sess.run(init) for i in range(num_episodes): #Reset environment and get first new observation s = env.reset() rAll = 0 d = False j = 0 #The Q-Network while j < 99: j+=1 #Choose an action by greedily (with e chance of random action) from the Q-network a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]}) if np.random.rand(1) < e: a[0] = env.action_space.sample() #Get new state and reward from environment s1,r,d,_ = env.step(a[0]) #Obtain the Q' values by feeding the new state through our network Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]}) #Obtain maxQ' and set our target value for chosen action. maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = r + y*maxQ1 #Train our network using target and predicted Q values _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ}) rAll += r s = s1 if d == True: #Reduce chance of random action as we train the model. e = 1./((i/50) + 10) break jList.append(j) rList.append(rAll) print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"

参考资料

  1. Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks
  2. Udacity Deep Learning Nano Degree

最后

以上就是诚心悟空最近收集整理的关于强化学习之Q-learning简介强化学习Q-learningDeep-Q-learning参考资料的全部内容,更多相关强化学习之Q-learning简介强化学习Q-learningDeep-Q-learning参考资料内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(71)

评论列表共有 0 条评论

立即
投稿
返回
顶部