[ICML 2015] Massively Parallel Methods for Deep Reinforcement LearningIntroductionDistributed ArchitectureExperiments

124 阅读 0 评论 82 点赞

我是靠谱客的博主直率大象，这篇文章主要介绍[ICML 2015] Massively Parallel Methods for Deep Reinforcement LearningIntroductionDistributed ArchitectureExperiments，现在分享给大家，希望可以做个参考。

http://arxiv.org/abs/1507.04296

Introduction

Existing work on distributed deep learning has focused exclusively on supervised and unsupervised learning. In this paper we develop a new architecture for the reinforcement learning paradigm, which we called Gorila (General Reinforcement Learning Architecture), aiming to solve a single-agent problem more efficiently by exploiting parallel computation.

Distributed Architecture

在这里插入图片描述
4 components.

Actor - Generate experiences.
Replay Memory
- First, a local replay memory stores each actor’s experience locally on that actor’s machine.
- Second, a global replay memory aggregates the experience into a distributed database.
Learner
- Generate gradients and send them to the parameter server.
- Receive the the parameters of the Q-network periodically from the parameter server.
Parameter server - Applies these gradients to modify the parameters, using an asynchronous stochastic gradient descent algorithm.

Bundled mode. The simplest overall instantiation of Gorila, in which there is a one-to-one correspondence between actors, replay memory, and learners. The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server.

Gorila DQN. The DQN algorithm is extended to the distributed implementation in Gorila as follows. The parameter server maintains the current parameters $θ +$ and the actors and learners contain replicas of the current Q-network $Q (s, a; θ)$ that are synchronized from the parameter server before every acting step. The learner additionally maintains the target Q-network $Q (s, a; θ -)$ . The learner’s target network is updated from the parameter server $θ +$ after every $N$ gradient updates in the central parameter server.

Stability. While the DQN training algorithm was designed to ensure stability of training neural networks with reinforcement learning, training using a large cluster of machines running multiple other tasks poses additional challenges. The Gorila DQN implementation uses additional safeguards to ensure stability in the presence of disappearing nodes, slowdowns in network traffic, and slowdowns of individual machines. One such safeguard is a parameter that determines the maximum time delay between the local parameters and the parameters in the parameter server. All gradients older than the threshold are discarded by the parameter server. Additionally, each actor/learner keeps a running average and standard deviation of the absolute DQN loss for the data it sees and discards gradients with absolute loss higher than the mean plus several standard deviations. Finally, we used the AdaGrad update rule.

Experiments

We evaluated Gorila by conducting experiments on 49 Atari 2600 games using the Arcade Learning Environment.

In all experiments, Gorila DQN used: $N_{param} = 31$ and $N_{learn} = N_{act} = 100$ . We use the bundled mode. Replay memory size $D$ = 1 million frames and used $ϵ$ -greedy as the behaviour policy with $ϵ$ annealed from 1 to 0.1 over the first one million global updates. Each learner syncs the parameters $θ -$ of its target network after every 60K parameter updates performed in the parameter server.