 ISSN 1751-956X

作者:Seyed Sajad Mousavi1 , Michael Schukat1, Enda Howley






Recent advances in combining deep neural network architectures with reinforcement learning (RL) techniques have shown promising potential results in solving complex control problems with high-dimensional state and action spaces.


Inspired by these successes, in this study, the authors built two kinds of RL algorithms: deep policy-gradient (PG) and value-function-based agents which can predict the best possible traffic signal for a traffic intersection.


At each time step, these adaptive trafficlight control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals. 


The PG-based agent maps its observation directly to the control signal; 


however, the value-function-based agent first estimates values for all legal control signals. 


The agent then selects the optimal control action with the highest value.


Their methods show promising results in a traffic network simulated in the simulation of urban mobility traffic simulator, without suffering from instability issues during the training process.


1 Introduction

With regard to fast growing population around the world, the urban population in the 21st century is expected to increase dramatically.


Hence, it is imperative that urban infrastructure is managed effectively to contend with this growth.


One of the most critical consideration when designing modern cities is developing smart traffic management systems.


The main goal of a traffic management system is reducing traffic congestion which nowadays is one of the major issues of megacities.


Efficient urban traffic management results in time and financial savings as well as reducing carbon dioxide emission into atmosphere.


To address this issue, a lot of solutions have been proposed [1–4].


They can be roughly classified into three groups.


The first is pre-timed signal control, where a fixed time is determined for all green phases according to historical traffic demand, without considering possible fluctuations in traffic demand. 


The second is vehicle-actuated signal control, where traffic demand information is used, provided by inductive loop detectors on an equipped intersection to decide to control the signals, e.g. extending or terminating a green phase.


The third is adaptive signal control, where the signal timing control is managed and updated automatically according to the current state of the intersection (i.e. traffic demand, queue length of vehicles in each lane of the intersection and traffic flow fluctuation) [5]. 


In this paper, we are interested in the third approach and aim to propose two novel methods for traffic signal control by leveraging recent advances in machine learning and artificial intelligence fields [6, 7].


Reinforcement learning (RL) [8] as a machine learning technique for traffic signal control problem has led to impressive results [2, 9] and has shown a promising potential solver.


It does not need to have a perfect knowledge of the environment in advance, for example, traffic flow.


Instead they are able to gain knowledge and model the dynamics of the environment just by interacting with it. 


An RL agent learns based on trial and error. It receives a scalar reward after taking each action in the environment. 


The obtained reward is based on how well the taken action is and the agent's goal is to learn an optimal control policy, so the discounted cumulative reward is maximised via repeated interaction with its environment. 


Aside from traffic control, RL has been applied to a number of real-world problems such as cloud computing [10, 11].


Typically, the complexity of using RL in real-world applications such as traffic signal management, grows exponentially as state and action spaces increase.


To deal with this problem, function approximation techniques and hierarchical RL (HRL) approaches can be used. 


Recently, deep learning has gained huge attraction and has been successfully combined with RL techniques to deal with complex optimisation problems such as playing Atari 2600 games [7], Computer Go program [12] etc., where the classical RL methods could not provide optimal solutions.

最近,深度学习已经取得巨大的吸引力并且已经成功的结合强化学习技术去解决了复杂的最优化问题比如,玩Atari 2600游戏,电脑GO 程序,等等,这种典型的强化学习算法不可以提供最佳的解决方案。

In this way, the current state of the environment is fed into a deep neural net [e.g. a convolutional neural network (CNN) [13]] trained by RL techniques to predict the next possible optimal action(s).

用这种方法,现在的环境状态是 馈入(反馈进入的意思吧)一个深度神经网络[例如一个卷积神经网络CNN]训练通过一个强化学习技术去预测下一个可能的最佳动作。

Inspired by the successes of combining RL with deep learning paradigm and with regard to the complex nature of environment of traffic signal control problem, in this paper we aim to use the effectiveness and power of deep RL to build adaptive signal control methods in order to optimise the traffic flow.


Although a few previous studies have tried to apply deep RL in the traffic signal control problem [14, 15], in this research the state representation is different.

尽管一些(很少的意思 屈指可数)以前的研究已经试着去应用深度强化学习在交通信号控制问题,在这个调查的表现是不同的。(意思是我很好 我很棒 我优秀)

Also, one of our methods uses policy-gradient (PG) method which does not suffer from oscillations and instabilities during the training process and can take full advantage of the available data of the environment to develop the optimal control policy.

尽管,我们方法的一种用策略梯度的方法,不必经历 振幅和不稳定在训练过程中,并且可以用到所有可以用到的数据环境中去进化自适用控制策略。

We propose adaptive signal controllers by combination of two RL approaches (i.e. PG and action-value function) and a deep convolution neural network, which perceive embedded camera observations in order to produce control signals in an isolated intersection. 


We conduct simulated experiments with our proposed methods in simulation of urban mobility (SUMO) traffic simulator.


The rest of this paper is organised as follows.


Section 2 provides related work in the area of traffic light control (TLC).


Section 3 gives a brief review of RL techniques which we have used in this research. 

第三部分 给出一个简短的评论关于强化学习,我们已经使用在这篇论文中。

Section 4 presents how to formulate the TLC problem as an RL task and the proposed methods to solve the task.

第四部分 提交我们怎样去构想这个交通灯控制问题作为一个强化学习人物和被提议的方法去解决这个任务。

Then, Section 5 provides simulation results and the performance of the proposed approaches.


Finally, Section 6 concludes this paper and gives some directions for future research.

最后,第六部分 结论这篇文章并且给出一些方向为了将来的研究调查。

2 Related work

A lot of research has been done in academic and industry communities to build adaptive traffic signal control systems.


In particular, significant research has been conducted employing RL methods in the area of traffic light signal control [16–20].


These works have achieved promising results. 


However, their simulation testbeds have not been mature enough to be comparable with more realistic situations. 


Developing advanced traffic simulation tools have made researchers develop novel state representation and reward functions for RL algorithms, which could consider more aspects of complexity and reality of real-world traffic problems [3,5, 21–24].


 All these attempts viewed the TLC problem as a fully observable Markov decision process (MDP) and investigated whether Q-learning algorithm can be applied to it.


However,Richter's study formulated the traffic problem as a partially observable MDP (POMDP) and applied PG methods to guarantee local convergence under a partial observable environment [25].


By utilising advances in deep learning and its application to different domains [11, 26, 27], deep learning has gained attention in the area of traffic management systems.


The previous research has used deep stacked auto-encoders (SAEs) neural networks to estimate Q values, where each Q-value is corresponding to each available signal phase [28]. 


It considered measures of speed and queueing length as its state in each time step of learning process of its proposed method. 


Two recent studies by van der Pol and Oliehoek [14] and Genders and Razavi [15] provided deep RL agents that used deep Q-network [7] to map from given states to Q values. 

两个学习智能最近的研究通过两个人提供的深度强化体,智能体用了深度q-network去绘制地图通过被给出状态的q 值。

Their state representations were a binary matrix of the positions of vehicles on the lanes of an intersection, and a combination of the presence matrix of vehicles, speed and the current traffic signal phase, respectively. 


However, we use raw visual input data of the traffic simulator snapshots as system states.

Moreover, in addition to estimating Q-function, one of the proposed methods directly maps from the input state to a probability distribution over actions (i.e. signal phases) via deep PG method.


 3 Background

In this section, we will review RL approaches and briefly describe how RL is applied to real-world problems where the number of states and actions are extremely high, so that the regular RL techniques cannot deal with them.


3.1 Reinforcement learning

A common RL [8] setting is shown in Fig. 1, where an RL agent interacts with an environment.


The interaction is continued until reaching a terminal state or the agent meets a termination condition.


Usually, the problems that RL techniques are applied to are treated as MDPs.


An MDP is defined as a five tuple S, A, T, R,γ , where S is the set of states in the state space of the environment, A is the set of actions in the action space that the agent can use in order to interact with the environment, T is the transition function, which is the probability of moving between the environment states, R is the reward function andγ∈ [0, 1] is known as the discount factor, which models the importance of the future and immediate rewards.


 At each time step t, the agent  perceives the state st∈ S and, based on its observation, selects an action at. 


Taking the action, leads to the state of the environment transitions to the next states st + 1∈ S regarding the transition function T. 

采取该操作,导致环境状态转换到下一个状态 st + 1∈ S 就转移函数 T来说。

Then, the agent receives reward rt which is determined by the reward function R.


The goal of the learning agent in RL framework is to learn an optimal policyπ: S × A→ [0, 1] which defines the probability of selecting action at in state st, so that with following the underlying policy the expected cumulative discounted reward over time is maximised. 


The discounted future reward, Rt, at time t is defined as follows:


where the role of the discount factor γ is to trade-off the worth of immediate and future rewards.


In most real-world problems, there are many states and actions which make it impossible to apply
classic RL techniques, which consider tabular representations for their state and action spaces.


Hence, it is common to use function approximators [29] or decomposition and aggregation techniques such as HRL approaches [30–32] and advance HRL [33].


Different forms of function approximators can be used with RL techniques.


For example, linear function approximation, a linear combination of feature of state and action spaces f and learned weights w (e.g.∑i f iw) or a non-linear function approximation (e.g. a neural network).


Until recently, the majority of work in RL has been applying linear function approximators.


More recently, deep neural networks (DNNs) such as CNNs, recurrent neural networks, SAE etc. have also been commonly used as function approximators for large RL tasks [6, 34].


The interested readers are referred to [35] for a review of using DNNs with RL framework.


3.2 Deep learning and deep Q-learning

Deep learning techniques are one of the best solutions to address high-dimensional data and extract discriminative information from the data.


Deep learning algorithms have the capability of automating feature extraction (the extraction of representations) from the data.


The representation is learned through the data which are fed directly into deep nets without using human knowledge (i.e. automated feature extraction).


Deep learning models contain multiple layers of representations. Indeed, it is a stack of building blocks such as auto-encoders, Restricted Boltzmann machines and
convolutional layers. 


During training, the raw data is fed into a network consisting of multiple layers.


The output of each layer which is non-linear feature transformations is used as inputs to the next layers of the DNN. 

每个非线性要素转换层的输出将用作 DNN 下一层的输入。

The output representation of the final layer can be used for constructing classifiers or those applications which can have the better efficiency and performance with abstract representation of the data in a hierarchical manner as inputs.


A non-linear transformation is applied at each layer on its input to try to learn and extract underlying explanatory factors.


Consequently, this process learns a hierarchy of abstract representations.


One of the main advantages of DNNs is the capability of automating feature extraction from raw input data.


A deep Q-learning network (DQN) [6] uses this benefit of deep learning in order to represent the agent's observation as an abstract representation in learning an optimal control policy. 


The DQN method aggregates a DNN function approximator with Q-learning to learn action-value function and as a result a policyπ, the behaviour of the agent which tells the agent what action should be selected for each input state. 

DQN 方法使用 Q 学习聚合 DNN 函数近似器,以学习操作值函数,并因此聚合 policy π,即代理的行为,该行为告诉代理应为每个输入状态选择什么操作。

Applying non-linear function approximators such as neural networks with model-free RL algorithms in high-dimensional continuous state and action spaces has some convergence problems [36]. 


The reasons for these issues are: (i) consecutive states in RL tasks have correlation.


(ii) The underlying policy of the agent is changing frequently, because of slight changes in Q values. 

(ii)由于 Q 值的微小变化,代理的基本策略经常变化。

To cope with these problems, the DQN provides some solutions which improve the performance of the algorithm significantly

为了应对这些问题,DQN 提供了一些解决方案,可显著提高算法的性能。

For the problem of correlated states, DQN uses the previously proposed experience replay approach [37].


In this way, at each time step, the DQN stores the agent's experience (st, at, rt, rt + 1) into a data set D, where st, at and rt are the state, chosen action and received reward, respectively, and st + 1 is the state at the next time step. 

通过这种方式,在每个时间步长中,DQN 将代理的经验(st、at、rt、rt + 1)存储到数据集 D 中,其中 st、at 和 rt 分别是状态、所选操作和收到的奖励,st + 1 是下一个时间步的状态。

To update the network, the DQN utilises stochastic minibatch updates with uniformly random sampling from the experience replay memory (previously observed transitions) at training time. 

为了更新网络,DQN 利用随机小分枝更新,在训练时从体验重放内存(先前观察到的转换)中统一随机抽样。

This negates strong correlations between consecutive samples. 


Another approach to deal with aforementioned convergence issues, which we also examine in this research, is the PG methods. 


This approach has demonstrated better convergence properties in some RL problems [38].


3.3 PG methods

A PG method tries to optimise a parameterised policy function by gradient-descent method. 

PG 方法试图通过梯度下降法优化参数化策略函数。

Indeed, PG methods are interested in searching policy space to learn policies directly, instead of
estimating state-value or action-value functions.


Unlike the
traditional RL algorithms, PG methods do not suffer from the
convergence problems of estimating value functions under non-
linear function approximation or in the environments which might
be POMDPs. 


They can also deal with the complexity of continuous state and action spaces better than purely value-based methods [38]. 


PG methods estimate policy gradients using Monte Carlo estimates of the policy gradients [39]. 


These methods are guaranteed to converge to a local optimum of their parameterised policy function. 


 However, typically PG methods result in high variance in their gradient estimates.


Hence, in order to reduce the
variance of the gradient estimators, some methods subtract a
baseline function from the policy gradients.


The baseline function
can be calculated in different manners [40, 41]. By inspiring these
features of PG methods and successes of neural networks in
automatic feature abstractions, we use DNNs to represent an
optimal traffic control policy directly in the traffic signal control


4 System description

In this section, we will formulate TLC problem as an RL task by describing the states, actions and reward function. 


We then present the policy as a DNN and how to train the network.


4.1 State representation (状态表示)

We represent the state of the system as an image st∈ Rd or a snapshot of the current state of a graphical simulator {e.g. SUMO-graphical user interface (GUI) [42]} which is a vector of raw pixel
values of current view of the intersection at each step of simulation(as shown in Fig. 1). 

我们将系统的状态表示为图像 St属于Rd或者一个快照最近状态的快照现在状态一个图片仿真的(比如:SUMO-绘画使用者界面)是交叉点当前视图的原始像素值的矢量,在十字路口在每一步仿真中例如1系统所示。

This kind of representation is such as putting a camera on an intersection which enables it to view the whole intersection.


The state representation in the TLC literature usually uses a vector representing the presence of a vehicle at the intersection, a Boolean-valued vector where a value 1 indicates the presence of a vehicle and a value 0 indicates the absence of a vehicle [14, 43] or a combination of the presence vector with another vector indicating the vehicle's speed at the given intersection [15]. 


Regardless of these states representations that are using a prior knowledge provided, they make assumptions which are not generalisable for the real world. 


However, by feeding the state as an image to a CNN, the system can detect the location and the presence of all vehicles with different lengths and as a result the vehicles’ queue on each lane.


Furthermore, by stacking a history of consecutive observations as input, the convolutional layers of a deep network are able to estimate velocity and travel direction of vehicles. 


Hence, the system can implicitly benefit from this information as well.


4.2 Action set(动作集)

To control traffic signal phases, we define a set of possible actions
A = {North/South green (NSG), East/West green (EWG)}.

交通灯相位控制,我们定义一个可能的动作A={北南 绿等,东西绿等}

allows vehicles to pass from North-to-South and vice versa and
also indicates the vehicles on East/West route should stop and not
proceed through the intersection.

南北绿等孕育车辆通过北到南并且反之亦然 并且也表明车辆在西或东应该停止和不应钱锦通过红绿灯。

EWG allows vehicles to pass
from East to West and vice versa and implies the vehicles on
North/South route should stop and not proceed through the


At each time step t, an agent regarding its strategy chooses an action at∈ A. 


Depending on the selected action, the vehicles on each lane are allowed to cross the intersection.


4.3 Reward function

Typically, an immediate reward rt∈ℝ is a scalar value which the agent receives after taking the chosen action in the environment at each time step.


We set the reward as the difference between the total cumulative delays of two consecutive actions, i.e.


where Dt and Dt− 1 are the total cumulative delays in the current and previous time steps. 

哪里t时刻的D 和t-1上一个时刻的D一共累积延时在现在和过去的时间步长中。

The total cumulative delay at time t is the summation of the cumulative delay of all the vehicles appeared from t = 0 to current time step t in the system.


The positive reward values imply the taken actions led to decrease the total cumulative delay and the negative rewards imply an increase in the delay.


With regard to the reward values, the agent may decide to change its policy in certain states of the system in the future.


4.4 Agent's policy

The agent chooses the actions based on a policy π.


In the policy-based algorithm, the policy is defined as a mapping from the input state to a probability distribution over actions A. 


We use the DNN as the function approximator and refer its parameters θ as policy parameters.


The policy distribution π(at | st;θ) is learned by performing gradient descent on the policy parameters.


The action-value-function maps the input state to action values, which each represents the future reward that can be achieved for the given state and action.


The optimal policy can then be extracted by performing a greedy approach to select the
best possible action.


4.5 Objective function and system training(目标函数和系统训练)

There are many measures such as maximising throughput, minimising and balancing queue length, minimising the delay etc.


in the traffic signal management literature to consider as the learning agent's objective function.


In this research, the agent aims to maximise the reduction in the total cumulative delay, which
empirically has been shown to maximise throughput and to reduce queue length (more details discussed in Section 5.3).


The objective of the agent is to maximise the expected cumulative discounted reward. 


We aim to maximise the reward under the probability distributionπ(at | st;θ).


We divide the system training based on two RL approaches: value-function-based and policy-based. 


In value-function-based approach, the value function Qπ(s, a) is defined as follows:


where it is implicit that s, s′∈ S and a∈ A. 

The value function can be parameterised Q(s, s;θ) with parameter vector θ. 


Typically, the gradient-descent methods are used to learn parameters,θ by trying to minimise the following loss function of mean-squared error in Q values,where r +γ maxa′Q(s′, a′;θ) is the target value. In the DQN algorithm, a target Q-network is used to address the instability problem of the policy. 

通常,梯度下降方法用于学习参数θ,方法是尝试最小化Q值中均方误差的以下损失函数,其中 +γ maxa′Q(s′, a′;θ) 是目标值。在 DQN 算法中,目标 Q 网络用于解决策略的不稳定问题。

The network is trained with the target Q-network to obtain consistent Q-learning targets by keeping the weight parameters (θ−) used in the Q-learning target fixed and updating them periodically every N steps through the parameters of the main network θ. 

使用目标 Q 网络对网络进行训练,通过保持 Q 学习目标中使用的权重参数 (θ−) 固定,并通过主网络 θ 的参数每 N 步定期更新一次,从而获得一致的 Q 学习目标。

The target value of the DQN is represented as follows:


whereθ− are parameters of the target network. The stochastic gradient-descent method is used in order to optimise (5). 

其中θ− 是目标网络的参数。随机梯度下降法用于优化 (5)。

The parameters of the deep Q-learning algorithm are updated as follows:

深度 Q 学习算法的参数更新如下:

where yi is the target value for iteration i andα is a scalar learning rate. 

其中 yi 是迭代 i 的目标值,而 α 是标量学习速率。 

Algorithm 1 (see Fig. 2) presents the pseudo-code for the training algorithm.

算法 1(参见图 2)提供了训练算法的伪代码。

In policy-based approach, the gradient of the objective function represented in (3) is given by: 


This (8) is standard learning rule of the REINFORCE algorithm[44]. 


It updates the policy parameters θ in the direction ∇θ log(at | st;θ), so that the probability of action at at state st is increased if it has led to high cumulative reward; however, it is decreased if the action has result in a low reward.

它更新了方向∇θ log(在| st;θ)方向上的策略参数θ,因此,如果导致高累积奖励,则在状态st处采取行动的概率增加;但是,如果动作导致低奖励,则会降低该值。

The gradient estimate in (2) results to have high variance. It is common to reduce the variance by subtracting a baseline function bt(st) from the return Rt, without changing expectation.

(2) 中的梯度估计值具有高方差。通常通过从返回 Rt 中减去基线函数 bt(st) 来减少方差,而不改变期望值。

Commonly, an estimate of the state-value function is used as the baseline, bt(st) = Vπθv(st). 

通常,状态值函数的估计值用作基线,bt(st) = Vπθv(st)。

Thus, the adjusted gradient is ∇θ log(at | st;θ)(Rt− bt(st)).

调整后的策略梯度是∇θ log(at | st;θ)(Rt− bt(st)).

The value Rt− bt is known as the advantage function.

值 Rt− bt 称为优势函数。

With regard to the advantage actor–critic method [45], computing a single update is done by selecting actions using the underlying policy for up to M steps or till a terminal state is met. 


In this way, the agent obtains up to M rewards from the environment at each update point and updates the policy parameters after every n≤ M steps regarding n-step returns. 

通过这种方式,代理在每个更新点从环境中获得最多 M 个奖励,并在每个 n≤ M 个步骤(有关 n 步返回)后更新策略参数。

The vector parameters θ are updated through the stochastic gradient-descent method:

矢量参数 θ 通过随机梯度下降法更新:

where A(st, at;θ,θv) is an estimate of the advantage function corresponding ∑i = 0
n− 1γirt + i +γnV(st + n;θ)− V(st;θv), where n might have different values with respect to the state, up to M. 

其中 A(st, at;θ,θv) 是对优势函数的估计∑i = 0 n− 1γirt + i +γnV(st + n;θ)− V(st;θv),其中 n 相对于状态可能具有不同的值,直到 M。

This process is an actor–critic algorithm, the policyπ(at | st;θ) refers to the actor and the estimate of the state-value function Vπθv(st) implies to the critic [45, 46].

这个过程是一个演员-批评家算法,policy π(在| st;θ)指的是演员,状态值函数Vπθv(st)的估计值暗示给批评者[45,46]。

Algorithm 2(see Fig. 3) shows the pseudo-code for the training algorithm.

算法 2(见图 3)显示了训练算法的伪代码。

5 Experiment and results 5 实验和结果

In this section, we present the simulation environment, where our experiments have been done. We then describe the details of the DNN utilised including hyper-parameters to represent the agent's policy.

在本节中,我们将介绍模拟环境,我们的实验已经完成。然后,我们描述所使用的 DNN 的详细信息,包括用于表示代理策略的超参数。

5.1 Experiment setup

We have used the SUMO [42] tool to simulate traffic in all experiments. 

我们使用 SUMO [42] 工具来模拟所有实验中的流量。

SUMO is a well known open source traffic simulator which provides useful application programming interfaces and a GUI view to model large road networks as well as some
possibilities to handle them.


In particular, we utilised SUMO–GUI v0.28.0 as it allows to have snapshots of each step of the

特别是,我们使用了 SUMO–GUI v0.28.0,因为它允许对仿真的每个步骤进行快照。

The intersection geometry used in this paper is shown in Fig. 4.


There are four incoming lanes to the intersection and four outgoing lanes from the intersection. 


To generate traffic demands from different directions (i.e. North-to-South and West-to-East and
vice versa) to the road network, we randomly sample from a uniform probability distribution with the probability of 0.1 to model vehicle generation at each 3600 time steps.


5.2 System architecture and hyper-parameters 系统架构和超参数

We took the snapshots from the SUMO–GUI and did some basic pre-processing. 

我们从 SUMO–GUI 拍摄快照,并进行了一些基本的预处理。

The snapshots are converted from red–green–blue representation to grey-scale and resized them to 128 × 128 frames.

快照将从红-绿-蓝表示转换为灰度,并将其大小调整为 128 × 128 帧。

To enable our system to memorise a history of the past observations, we stacked the last four frames of the history and provided them to the system as input.


So, the input to the network was a 128 × 128 × 4 image.


We applied approximately the same architecture of the deep Q-network (DQN) algorithm introduced by Mnih et al. [6, 7]. 


The network consists of a stack of two convolutional layers with 16 8 × 8 and 32 4 × 4 filters with strides 4 and 2, respectively. 

该网络由两个卷积层组成的堆栈组成,分别具有 16 个 8 × 8 和 32 个 4 × 4 个过滤器,步幅分别为 4 和 2。

The final hidden layer is fully connected with 256 hidden nodes. 


All three hidden layers are followed by a rectifier non-linearity. 


The main difference with the network architecture of the DQN method is the last layer, where the last layer of DQN is a fully connected linear layer with a number of output neurons [i.e. Q values Q(a, s)] corresponding to each action in a given Atari 2600 game, while in policy-based model the last layer represents two sets of outputs, a softmax output resulting in a probability distribution over the actions A [i.e. the policyπ(a, s)], and a single linear output node resulting in the estimate of the state-value function V(s). 

与DQN方法的网络架构的主要区别在于最后一层,其中DQN的最后一层是完全连接的线性层,具有许多输出神经元[即Q值Q(a,s)]对应于给定Atari 2600游戏中的每个操作,而在基于策略的模型中,最后一层表示两组输出, 一个 softmax 输出导致作用 A [即 policyπ(a, s) ] 上的概率分布,以及一个线性输出节点,从而产生状态值函数 V(s) 的估计值。

For value-function model we used the architecture, the same as the DQN.

对于值函数模型,我们使用与 DQN 相同的体系结构。

The output layer is corresponding to action values. 


In all of our experiments, the discount factor was set to γ = 0.99 and all weights of the network
were updated by the Adam optimiser [47] with a learning rate α = 0.00001 and with mini batches of size M (up to 32), the maximum number of steps that the agent can take to follow its policy and afterwards need to update it. 

在我们所有的实验中,贴现因子都设置为γ = 0.99,并且网络的所有权重都由Adam优化器[47]更新,学习速率α = 0.00001,大小为M的迷你批次(最多32个),这是代理可以遵循其策略的最大步骤数,然后需要更新它。

The network was trained for about 1050 epoch,∼2 million time steps.


Each epoch is corresponded ten episodes and each episode was a complete SUMO–GUI simulation.

每个纪元对应十集,每集都是完整的 SUMO–GUI 模拟。

The learned policies by the agent was evaluated every ten episodes by running SUMO–GUI for five episodes and averaging the resulting rewards, total cumulative delay and queue length.

通过运行 SUMO–GUI 五集并平均生成的奖励、总累积延迟和队列长度,每十集对代理学习的策略进行一次评估。

To evaluate our proposed method, we also built a shallow neural network (SNN) with one hidden layer.


The hidden layer has 64 hidden nodes followed by a rectifier non-linearity. 


The output layer is a fully connected linear layer with a number of output neurons corresponding to each traffic signal phase in the intersection.


Two vectors are used as input state of the network.


The first representing the number of queued vehicles at the lanes of the intersection (i.e. North, South, East and West) and the second corresponding to the current traffic signal phase of the intersection.

第一个向量代表排队的车辆的数量在十字路口车道线(换句话说 东南西北)和第二个对应于现在十字路口的交通信号灯相位。

SNN is trained with the same hyper-parameters and optimisation method (i.e. the gradient decent algorithm) as the proposed methods.


5.3 Results and discussion 结果和讨论

To evaluate the performance of the proposed methods, we compared them against a baseline traffic controller, a controller that gives an equal fixed time to each phase of the intersection. 


We ran SUMO–GUI simulator for the proposed model using the configuration setting explained in Section 5.2 and compared the average reward, average total cumulative delay and average queue length achieved to the baseline.

我们使用第 5.2 节中介绍的配置设置为建议的模型运行 SUMO–GUI 模拟器,并将平均奖励、平均总累积延迟和平均队列长度与基线进行比较。

Fig. 5 shows the received average reward while the agent follows a certain policy. As shown in Fig. 5, the proposed method performs significantly better than the baseline and results more reward magnitudes by doing more epochs.


This gradually increasing reward reflects the agent's ability to learn an optimal control policy in a stable manner


Unlike using deep RL for estimating the Q values in traffic light optimisation problem [14], the proposed agent does not suffer stability issues.


To assess the learned policy by the agent, two of the most common performance metrics in the traffic signal control literature is implemented: the cumulative delay and queue length.


Figs. 6 and 7 illustrate the performance comparison of the leaning agent regarding average cumulative delay time and average queue length metrics, respectively, to the baseline, while the agent is following the learning policy over time. 

图 6 和图 7 说明了倾斜代理在平均累积延迟时间和平均队列长度指标方面的性能比较,分别与基线,而代理则遵循学习策略随时间的变化。

The plots clearly show the agent is able to find a policy resulting minimising queue length and total cumulative delay.


Moreover, these graphs reveal that by using the reward function for reducing cumulative delay, the intersection queue length is reduced as well as the total cumulative delay of all vehicles.


We also compared the proposed methods with the SNN, which is a shallow neural network with one hidden layer.


Table 1 reports a comparison of the proposed models and the SNN model in terms of the average and standard deviation (μ,σ) of average queue length,
average cumulative delay time and the received average reward metrics. 


The results on Table 1 are calculated from the last 100 training epochs of each method.

表 1 中的结果根据每种方法的最后 100 个训练周期计算得出。

Comparing the metrics shown in Table 1, demonstrates that the proposed models significantly outperform the SNN method. 

通过比较表 1 中所示的指标,可以看出所提出的模型明显优于 SNN 方法。

On the basis of the data in Table 1 we can induce 67 and 72% reductions in the average cumulative delay and queue length for the PG method and 68 and 73% reductions for value-function-based method compared with the SNN.

根据表 1 中的数据,与 SNN 相比,PG 方法的平均累积延迟和队列长度可减少 67% 和 72%,基于值函数的方法可减少 68% 和 73%。

Furthermore, we can see that the proposed methods have received average rewards superior to the SNN. 


Considering these results, it is obvious that the policy gradient and value-function agents could learn the control policies better than the SNN approach.


考虑到这些结果,很明显,策略梯度和值函数代理可以比 SNN 方法更好地学习控制策略。

6 Conclusion 结论

In this paper, we applied deep RL algorithms with focusing on both policy and value-function-based methods to traffic signal control problem in order to find optimal control policies of signalling, just by using raw visual input data of the traffic simulator snapshots.


Our approaches have led to promising results and showed they could find more stable control policies compared with the previous work of using deep RL in traffic light optimisation.


In our work, we developed and tested the proposed methods in a small application, extending the work for more complex traffic simulations, for instance considering many intersections and multiple agents to control each intersection, using multi-agent learning techniques to handle coordination problem between agents would be a direction for future research.



