Deep Reinforcement Learning for Addressing Disruptions in Traffic Light Control

: This paper investigates the use of multi-agent deep Q-network (MADQN) to address the curse of dimensionality issue occurred in the traditional multi-agent reinforcement learning (MARL) approach. The proposed MADQN is applied to traffic light controllers at multiple intersections with busy traffic and traffic disruptions, particularly rainfall. MADQN is based on deep Q-network (DQN), which is an integration of the traditional reinforcement learning (RL) and the newly emerging deep learning (DL) approaches. MADQN enables traffic light controllers to learn, exchange knowledge with neighboring agents, and select optimal joint actions in a collaborative manner. A case study based on a real traffic network is conducted as part of a sustainable urban city project in the Sunway City of Kuala Lumpur in Malaysia. Investigation is also performed using a grid traffic network (GTN) to understand that the proposed scheme is effective in a traditional traffic network. Our proposed scheme is evaluated using two simulation tools, namely Matlab and Simulation of Urban Mobility (SUMO). Our proposed scheme has shown that the cumulative delay of vehicles can be reduced by up to 30% in the simulations.


Introduction
Traffic congestion has become a problem in most urban areas of the world, causing enormous economic waste, extra travel delay, and excessive vehicle emission [1]. The traffic light controllers, which are installed to monitor and control the traffic flows at intersections in order to alleviate traffic congestion strategically. Each traffic light controller has: a) light colors in which green color represents "go", amber represents "slow down", and red represents "stop"; b) traffic phase, in which a set of green lights are assigned to a set of lanes for safe and non-conflicting movements at the intersections; and c) traffic phase split is the time period of a traffic phase. The traffic phase split includes a short moment of red lights for all lanes to provide a safe transition in between traffic phases. The time passed since the respective lane light has changed to red is represented by red timing.
Next, we present five main fundamentals related to our investigation in the use of MADQN to traffic light controllers for reducing the cumulative delay of vehicles. Firstly, various types of traffic light controllers are presented to control and alleviate traffic congestion. Secondly, various artificial intelligence approaches are presented to accomplish fully-dynamic traffic light controllers. Thirdly, the traditional DQN approach is presented as an enhanced artificial intelligence approach applied to traffic light controllers. Fourthly, the significance of using the cumulative delay of vehicles as the performance measure is presented. Fifthly, traffic disruptions to traffic network are introduced using the Burr distribution. The contributions of this paper are also presented.

Types of Traffic Light Controllers
Traditionally, traffic light controllers monitor traffic movements and determine traffic phase splits to control and reduce congestion in traffic by using three main techniques. Firstly, a deterministic traffic light controller is a pretimed control system that uses historical data of traffic collected at different times and determines traffic phase splits using the Webster formula [2]. Secondly, a semidynamic traffic light controller is a system based on actuated control, and it uses instantaneous or short-term traffic condition, such as the absence and presence of vehicles, and assigns green lights to lanes with vehicles [3]. The short-term traffic condition can be detected using an inductive loop detector. Thirdly, a fully-dynamic traffic light controller is also a system based on actuated control; however, it uses longer-term traffic condition, including the average queue length and waiting time of vehicles of a lane. The longer-term traffic condition (e.g., the average queue length and the waiting time of vehicles at a lane) can be measured by the number of vehicles using at least two inductive loop detectors at a lane, one is installed near the intersection and another one is installed further away from the intersection [4]. By measuring the number of vehicles, the average queue length and waiting time of vehicles at a lane can be calculated. Video-based traffic detector (or camera sensor) can also be installed at an intersection, and image processing can be used to calculate the number of vehicles at a lane of an intersection [5]. Subsequently, it adjusts the traffic phase split based on the traffic condition [6]. The fully-dynamic traffic light controllers are more realistic in monitoring traffic movements because it uses longer-term traffic conditions, and the approach has shown to alleviate traffic congestion more effectively compared to deterministic and semi-dynamic traffic light controllers [7][8][9].

Common Approaches of Fully-Dynamic Traffic Light Controllers
The fully-dynamic traffic light controller has commonly been accomplished using artificial intelligence approaches, particularly reinforcement learning (RL) [10,11] and multi-agent reinforcement learning (MARL) [12]. RL can be embedded in a single agent (or a decision maker), which is the traffic light controller, to learn, make optimal action (i.e., traffic phase split), and improve its operating environment (i.e., the local traffic condition). In contrast, MARL can be embedded in multiple traffic light controllers (or agents) to exchange knowledge (i.e., Q-values), learn, make optimal joint action, and improves their operating environment (i.e., the global traffic condition) in a collaboration. However, the curse of dimensionality has occurred in single-agent RL and MARL, in which the state space is too large to be handled efficiently due to the complexity of the traffic congestion issue [13], and so this paper uses multi-agent deep Q-network (MADQN) that is based on the traditional singleagent DQN approach [14]. This paper extends our previous work [15] that mainly focuses on measuring the performance measures, such as throughput, waiting time, and queue length, which are unable to relate to drivers' experience. This paper investigates the use of MADQN to traffic light controllers at intersections with high volume of traffic and traffic disruptions (i.e., rainfall) by measuring the cumulative delay of vehicles as the performance measure, which relate to drivers' experience.

DQN as an Enhanced RL-Based Approach for Traffic Light Controllers
DQN is a combination of the new evolving deep learning technique [16] and the traditional RL technique, conveniently called deep reinforcement learning (DRL) [17]. DQN solves the curse of dimensionality and provides two main advantages [18]: a) reduces the learning time and computational cost incurred to explore different pairs of state-action and identify the actions that are optimal; and b) uses hidden layers to provide abstract and continuous representations of the complex and highdimensional inputs (i.e., the state space) for reducing the capacity of storage needed to store the unlimited number of pairs of state-action (or the Q-values).

Cumulative Delay of Vehicles as the Performance Measure Used by Traffic Light Controllers
The cumulative delay of the vehicles is the average time (i.e., average travelling and waiting times caused by congestion) taken by vehicles to travel from a source location to a destination location, which may require crossing multiple intersections. Compared to other measures, including the average queue length and waiting time of the vehicles of a lane, and throughput (i.e., the number of vehicles crossing an intersection), the cumulative delay can be directly perceived by drivers, and so it relates to the drivers' experiences. In other words, drivers perceive the difference between the actual and expected travel times when crossing multiple intersections. The investigations with respect to the cumulative delay of vehicles as the performance measure has gained momentum over the years with the use of DRL to traffic light control. Specifically, the cumulative delay is the most frequently used performance measure in the literature from January 2016 to November 2020 as compared to other performance measures in the investigation of the use of DRL to traffic light control as shown in Fig. 1. The scientific literature databases, including Web of Science [19], IEEE Xplore Digital Library [20], and ScienceDirect [21], have been used to conduct this study. The cumulative delay of vehicles has gained momentum over the years due to its better reflection of the real circumstances, particularly the drivers' experiences. While the cumulative delay has been used in the literature [22,23], it has not been applied to traffic congestion at multiple intersections scenario under the presence of increased traffic volume and traffic disruptions, and so this paper adopts this measure to calculate the average time required by vehicles to cross multiple intersections.

Burr Distribution for Introducing Traffic Disruptions to Traffic Networks
Traffic congestion can be categorized into: a) recurrent congestion (RC) caused by the high volume of traffic; and b) non-recurrent congestion (NRC) caused by erratic traffic disruptions, including accidents, and rainfall [24,25]. In the literature [26], the arrival process of vehicles has been widely modeled by the Poisson process, whereby the inter-arrival time of vehicles follows an exponential distribution. The Poisson process incorporates RC naturally; however, it does not incorporate NRC, and so the Burr distribution, which is the generalization of the Poisson process, has been adopted by this work to model the vehicle's time of inter-arrival under scenarios with a high volume of traffic (i.e., RC) and traffic disruptions (i.e., NRC).

Contributions of the Paper
Our contribution is to investigate the use of MADQN to traffic light controllers at intersections with a high volume of traffic and traffic disruptions (i.e., rainfall). This work is based on simulation using the Burr distribution, which has been shown to model traffic disruptions in traffic networks accurately in [27]. The performance measure is the cumulative delay of vehicles, which includes the average waiting and travelling times caused by congestion. This performance measure is used because the difference between the actual and expected travel times when crossing multiple intersections can be directly perceived by drivers, and so it relates to the drivers' experiences. In this paper, we aim to show a performance comparison between MADQN and MARL applied to traffic light controllers at intersections with a high volume of traffic and traffic disruptions in terms of the cumulative delay of vehicles, which has not been investigated in the literature despite its significance.

Organization of the Paper
The paper is structured into six sections.
• Section 1 presents the introduction of traffic light controllers and the background of common approaches for traffic light controllers. It also presents the contributions of the paper. • Section 2 presents the background of the Krauss vehicle-following model, DRL and MARL.
The traditional algorithms of DRL and MARL are also presented in this section. • Section 3 presents the literature review of DQN-based traffic light controllers. There are five main DQN approaches discussed in this section. • Section 4 presents the proposed MADQN model for traffic light controllers. It presents the representations of the proposed MADQN model, including the state space, action space, and delayed reward, applied to traffic light controllers. It also presents the MADQN architecture and algorithm. • Section 5 presents an application for sustainable city (i.e., Sunway city), and our simulation results and discussion. • Section 6 concludes this paper with a discussion of potential future directions.

Background
This section presents the background of the Krauss vehicle-following model, DRL and MARL. The Krauss vehicle-following model is a mathematical model of safe vehicular movement, whereby a gap between two consecutive vehicles is maintained. The background of DRL includes the traditional single-agent deep Q-network (DQN) algorithm. The background of MARL includes its traditional algorithm.

Krauss Vehicle-Following Model
In 1997, Krauss developed a vehicle-following model based on the safe speed of vehicles. The safe speed is calculated as follows [28]: where u l (t) and u f (t) represent the speed of the leading and following vehicles at time t, respectively, and g (t) is the gap to the leading vehicle at time t. The driver's reaction time (e.g., one second) is represented by τ r , and b is the maximum deceleration of the vehicle.
In our study, the Krauss vehicle-following model is used to ensure the safe movement of vehicles at intersections of the Sunway city and grid traffic networks (see Section 5).

Deep Reinforcement Learning
DRL incorporates DNNs into RL that enables agents to learn relationships between actions and states. DeepMind has first proposed the DQN [17], which is the DRL method, and it has been widely adopted in traffic light control [29]. DQN consists of a DNN, which is comprised of three main kinds of layers, namely the input layer, hidden layer(s), and output layer. In DQN, the neurons are interconnected with each other and they can learn complex and unstructured data [30]. During training, the data flows from the input layer to the hidden layer(s), and finally to the output layer. DQN provides two main features, which are: a) experience replay, in which experiences are stored in a replay memory, and then the experiences are randomly selected for training; and b) target network, which is the main network duplicate. The main network selects actions based on observations from the operating environment and updates its weights. The target network approximates the weights of the main network to generate its Q-value. During training, the Q-value of the target network is used to calculate the loss incurred by a selected action, and it has shown to stabilize training. After every certain number of iterations, the target network is updated. The main difference between the main and target networks is that the main network is used during observation, action selection and training processes, while the target network is used during training process only.

Algorithm 1:
The traditional single-agent DQN algorithm observe current state s t 4.
select action v t using Eq. (2) (Continued) 6. receive delayed reward r t+1 (s t+1 ) and next state s t+1 7. store sample a random minibatch of experiences e n from replay memory D t 9.
perform a gradient descent optimization on (y j − Q(s j , v j ; θ j )) 2 with respect to θ j using Eq. (5) 13.
end for 15. end for 16. end for 17. end procedure Algorithm 1 shows the DQN algorithm. In m ∈ M, which is an episode, the current state s t ∈ S (or the decision making factors) is observed by an agent. At t ∈ T, which is a time instant, the best-known (or greedy) action v * t ∈ V is selected by an agent as follows: where Q t (s t , v t ; θ t ) is the Q-value, which indicates whether the action v t is appropriate under state s t , and θ t are the parameters of main network. After that, the agent receives the delayed reward r t+1 (s t+1 ) and next state s t+1 , and then it stores its experience e t = (s t , v t , r t+1 (s t+1 ) , s t+1 ) in a replay memory D t = (e 1 , e 2 , . . . , e t ). After that, a minibatch of experiences e n is sampled by an agent from the replay memory D t randomly. Suppose, the target network Q-value is Q t (s t , v t ; θ − t ) and the main network Q-value is Q t (s t , v t ; θ t ). The target Q-value is fixed for C steps to stabilize the Q-values of the main network, and to reduce the loss between the Q-values of the target and main networks. To train the main network, the loss function is reduced at iteration j as follows: where p(s j , v j ) is the state-action (s j , v j ) pair probability distribution, and y j is a target, as follows: where γ is a discount factor, in which the discounted reward γ max v Q(s j+1 , v; θ j ) represents the longterm reward estimated by the maximum Q-value at iteration j + 1, and the delayed reward r j+1 s j+1 represents the short-term reward. If episode terminates at s j+1 , then y j = r j+1 s j+1 . The loss function gradient ∇ θ L j (θ j ) is given as follows: The target Q-values Q t (s t , v t ; θ − t ) of the target network is updated by replacing the weights θ − j of the target network with the weights θ j of the main network in order to provide Q j (s j , v * j ; θ − j ) ≈ Q * (s j , v j ; θ j ) at every C steps (i.e., equivalent to a number of iterations [31]).

Multi-Agent Reinforcement Learning
MARL is an extended approach of the traditional RL approach that enables multiple agents to exchange information with each other in order to achieve the optimal network-wide performance [32]. The optimization of the network-wide objective function is the main purpose, such as the global Qvalue that sums up the local Q-values of all agents in a single network, as time goes by t = 1, 2, 3 . . .. Algorithm 2 shows the MARL algorithm. At time instant t ∈ T, an agent i observes its current local state s i where n i,j represents the importance (or weight) of an agent j in the neighborhood of agent i, and j∈J i n i,j = 1.
This section presents a literature review of DQN-based traffic light controllers, which have shown to achieve various performance measures. Five main DQN approaches have been proposed to reduce the cumulative delay of vehicles. In general, the DQN approaches are embedded in traffic light controllers. The DQN model has an input layer that receives state, and an output layer that provides Q-values for possible actions (e.g., traffic phases [33] and traffic phase splits [34]).

Traditional DNN-Based DQN
The application of the traditional DNN-based DQN approach to traffic light control is proposed in [33,35]. In the Wan's DNN-based DQN model [33]: a) state represents the current traffic phase, the queue length, and the green and red timings; b) action represents a traffic phase; and c) reward represents the waiting time of vehicles. In the Tan's DNN-based DQN model [35]: a) state represents the queue length of vehicles; b) action represents a traffic phase; and c) reward represents the queue length and waiting time, as well as throughput. The proposed schemes have shown to reduce the cumulative delay [33,35] and queue length [35] of vehicles, and improve throughput [33].

CNN-Based DQN
The convolutional neural network (CNN)-based DQN approach enables agents to analyze visual imagery of traffic in an efficient manner. The agents process the input states, which are represented in the form of a two-dimensional matrix (i.e., multiple rows and columns of values, such as an image) or one-dimensional vectors (e.g., a single row or column of values, such as the queue length of a lane) [29]. The application of the CNN-based DQN approach to traffic light control is proposed in [22], [23,34,36]. In the Genders' and Gao's CNN-based DQN model [22,34]: a) state represents the current traffic phase, as well as the position and speed of vehicles; b) action represents a traffic phase [22] and a traffic phase split [34]; and c) reward represents the waiting time of vehicles. In the Wei's CNN-based DQN model [36]: a) state represents the current traffic phase, as well as the position and queue length of vehicles; b) action represents a traffic phase; and c) reward represents the waiting time and queue length. In the Mousavi's CNN-based DQN model [23]: a) state represents the current traffic phase and queue length; b) action represents a traffic phase; and c) reward represents the waiting time of vehicles. The proposed schemes have shown to reduce the cumulative delay [22,23,34,36], waiting time [34], and queue length [22,23,36] of vehicles, as well as improve throughput [22,36].

SAE-Based DQN
The stacked auto encoder (SAE) neural network-based DQN approach enables agents to perform encoding and decoding functions, and store inputs efficiently. The application of the SAE neural network-based DQN approach to traffic light control is proposed in [37]. In the Li's SAE neural network-based DQN model [37]: a) state represents the queue length; b) action represents a traffic phase split; and c) reward represents the queue length and waiting time. The simulation results have shown that the proposed scheme can reduce the cumulative delay and queue length of vehicles.

LSTM-Based DQN with A2C
The long short-term memory (LSTM) neural network-based DQN approach enables agents to memorize previous inputs of a traffic light control using a memory cell that maintains a time window of states. The advantage of the actor critic (A2C)-based method is that it is a combination of valuebased and policy-gradient (PG)-based DQN method. Each agent has an actor that controls how it behaves (i.e., PG-based), and a critic that measures the suitability of the selected action (i.e., valuebased). The application of the LSTM neural network-based DQN with A2C approach to traffic light control has been proposed in [38]. In [38]: a) state represents the queue length; b) action represents a traffic phase; and c) reward represents the queue length and waiting time. The simulation results have shown that the proposed scheme can reduce the cumulative delay and queue length of vehicles, as well as improve throughput.

MADQN
The MADQN approach allows multiple DQN agents to share knowledge (i.e., Q-values), learn, and make optimal joint actions (i.e., traffic phase split) in a collaboration. In the Rasheed's MADQN model [15]: a) state represents queue length, the current traffic phase, red timing, and the rainfall intensity; b) action represents a traffic phase split; and c) reward represents the waiting time of the vehicles. The simulation results have shown that the proposed scheme can reduce the waiting time and queue length at a lane, and improve throughput. In this paper, we extend the work in [15] by evaluating the cumulative delay of vehicles incurred by MARL and MADQN at multiple intersections, while having traffic disruptions (i.e., rainfall).

Our Proposed MADQN Approach for Traffic Light Controllers
In a traffic network, an intersections set I is considered in this paper, whereby i ∈ I is an intersection in which: a) K i is an incoming lane set, and b) J i is a neighboring intersection set. Fig. 2 shows an abstract model of MADQN, in which the agent i and its neighboring agents j = 1 ∈ J i and j = 2 ∈ J i share the same traffic environment. This research uses four traffic phases: a) the north-east bound traffic phase; b) the east-south bound traffic phase; c) the west-north bound traffic phase; and d) the south-west bound traffic phase. The traffic phases are activated in a roundrobin fashion by traffic light controllers at intersections, and our MADQN approach is used to adjust the time intervals of the traffic phases (i.e., traffic phase splits). Our proposed MADQN approach, including the MADQN model (i.e., the state, action, and delayed reward representations), the MADQN architecture, and the MADQN algorithm with its complexity analysis are presented in the remainder of this section.

MADQN Model
MADQN has three main advantages as compared to MARL as follows: • MADQN uses DNNs, which provide the state space with its continuous representation.
Consequently, it represents an unlimited number of pairs of state-action. • MADQN addresses the curse of dimensionality by providing efficient storage for complex inputs. • MADQN uses a target network with experience replay, and so it improves the stability of training.
The remainder of this subsection presents the representations of the state, action, and delayed reward of the MADQN model at an intersection i at time t. 4

.1.1 State
The state s i t = (s i 1,t , s i 2,k,t , s i 3,k,t , s i 4,t , s i 5,t ) ∈ S represents the decision making factors as follows: • s i 1,t ∈ {0, 1, 2, 3} represents the current traffic phase, and it is a discrete state. The north-east bound traffic phase is represented by a 0 value, the east-south is represented. • s i 2,k,t ∈ {0, 1, 2, 3} , ∀k ∈ K i represents the queue length of the incoming lanes K i and it is a continuous state. No vehicle at a lane is represented by a 0 value, ≤ 25% occupancy is represented by 1, > 25% and ≤ 50% is represented by 2, and > 50% is represented by 3. The occupancy can be measured using inductive loop detectors installed at intersections. • s i 3,k,t ∈ {0, 1, 2, 3} , ∀k ∈ K j represents the queue length of the incoming lanes k ∈ K j at a neighboring intersection j, and it is a continuous state. Both s i 2,k,t and s i 3,k,t have similar representation.
• s i 4,t ∈ t i,k red,t represents the red timing of the current traffic phase, and it is a continuous state. • s i 5,t ∈ {0, 1, 2, 3, . . . , s i 5 } represents the intensity of rainfall with a 0 value means no rain and s i 5 , which is the maximum value, means the heaviest rain, and it is a continuous state. Simply, the intensity of the disruption is represented by the sub-state s i 5,t .

Action
The action v i t ∈ V i represents a selected action, which is a traffic phase split v i t ∈ {0, 1, 2, 3, 4} in a fixed predetermined round-robin sequence of traffic phases, where v i t = 0 skips a traffic phase; for instance, due to the absence of a waiting vehicle at a lane. The north-east bound traffic phase is represented by a 1 value, the east-south is represented by 2, the west-north is represented by 3, and the south-west is represented by 4. Hence, agent i can select to switch to another traffic phase or to keep the current traffic phase.

Delayed Reward
An agent receives delayed rewards that vary with the average waiting time of the vehicles at the intersections. Traffic congestion can cause an increment in the average waiting time of the vehicles at the intersections. The delayed reward r i t+1 s i t+1 = W i t − W i t+1 is a relative value that represents the difference of the average total waiting time of all vehicles at an intersection i at time t and t + 1 (i.e., before and after taking an action v i t ), whereby W i t > W i t+1 gives a positive delayed reward, gives a zero delayed reward, and W i t < W i t+1 gives a negative delayed reward. Architecture   Fig. 3 shows the DQN architecture. There are three main components in an agent, namely the main network, the target network, and the replay memory. The main network consists of a DNN with its weight θ i t used to approximate its Q-values

MADQN
The main network is used to choose an action v i t for a particular state s i t observed from the operating environment in order to achieve the best possible delayed reward r i t+1 s i t+1 and next state s i t+1 at the next time instant t + 1. The target network is a copy (or duplicate) of the main network with its weight θ i− t used to approximate its Q-values The target network is used during training only, and the main network is used during both action selection and training. The replay memory represents the dataset of an agent's experiences

MADQN Algorithm
In this section, the extension of the traditional DQN approach to MADQN for multiple intersections is presented, and it has not been explored in the literature. The proposed MADQN algorithm is evaluated in simulation under different traffic networks (see Section 5) in the presence of traffic disruptions.
MADQN allows knowledge to be learned and exchanged among multiple DQN agents for coordination. The traditional MARL approach enables an agent to choose an optimal action based on its neighboring agents' actions. In the moving target scenario, actions are selected independently by agents simultaneously, and so the action selected by an agent i can affect the operating environment of its neighboring agents J i . Therefore, the moving target scenario has increased dynamicity of operating environment that affects learning stability. For instance, at an upstream intersection, a traffic light controller i's action can have positive or negative effects on the congestion level of downstream and neighboring intersections J i since vehicles move from one intersection to another. Likewise, the agent's i action at an intersection can be affected by agents J i actions at neighboring intersections. By exchanging knowledge and coordinating the agents, the convergence to an optimal action in a multiagent system has been shown in the literature [39]. The summation of the local Q-values of the agents is known as the global Q-value, which represents the global objective function. An optimal equilibrium is attained when a convergence is achieved by the global Q-value. The convergence is attributed to: a) an agent updates its Q-values by using Q-values from neighboring agents J i ; b) the availability of a local view of neighboring agents J i at an agent i; and c) an agent's action being the best response to the agents' neighbors. MADQN addresses moving target by taking neighboring agents' actions into consideration, and coordinating among themselves in a collaborative manner, in order to converge to an optimal joint action and achieve stability in a shared environment. observe current state s i t ∈ S {Knowledge sharing process} 4.
select action v i t using Eq. (9) 8.
receive delayed reward r i t+1 (s i t+1 ) and next state s i compute the loss function using Eq.
perform a gradient descent optimization on Message complexity is the number of messages exchanged between the agents in order to calculate the Q-values. An agent i exchanges its knowledge (i.e., Q-values) with its neighboring agents J (Steps 4-5), so the step-wise message complexity is given by ≤ |J| since there are |J| neighboring agents. The agent-wise complexity is calculated as ≤ |J|.
Storage complexity is the amount of memory needed to store knowledge (i.e., Q-values) and the experiences of agents. An agent i stores its experience e i t = (s i t , v i t , r i t+1 s i t+1 , s i t+1 ) (Step 9), so the stepwise storage complexity has a value of 1, and the agent-wise complexity is calculated as ≤ |S||A|.

Application for Sustainable City and Simulation Results
An investigation of the proposed scheme has been conducted in a case study based on a real traffic network, which is part of a sustainable urban city project in the Sunway City of Kuala Lumpur in Malaysia. Investigation is also performed using a grid traffic network (GTN) to understand the performance of the proposed scheme in a complex traffic network. Hence, our investigation covers both real-world and complex traffic networks, which are based on simulation. In this paper, the traffic networks with a left-hand traffic is considered, in which the traffic movement for the left turn is either protected or does not conflict with other traffic movements. This section also presents simulation results and discussion for our simulation in both RC and NRC environments.

Sunway City in Kuala Lumpur
Sunway city is one of the sustainable and smart cities in Malaysia [40]. It has busy commercial areas, residential areas with high density (i.e., LaCosta and Sunway Monash Residence), higher educational institutions (i.e., Monash University Malaysia campus and Sunway University), health centre (i.e., Sunway Medical Centre), amusement park (i.e., Sunway Lagoon), hotel (i.e., Sunway Resort Hotel & Spa), and so on, as shown in Fig. 4. In Fig. 4, the Sunway city traffic network (SCTN) has seven intersections, whereby every intersection has a traffic light controller. Fig. 5 shows the traffic phases, and Tab. 1 shows the traffic phase splits of existing (i.e., deterministic) traffic light controllers at all intersections in SCTN. The traffic phase splits were observed during the evening busy hours (i.e., 5-7 pm) of a working day, and they were measured using a stopwatch.
Malaysia ranks third and fifth worldwide in the number of lightning strikes (i.e., around 240 thunderstorm days/year [41]) and rainfall (i.e., around 1000 mm/year [42]), respectively. So, traffic congestion caused by traffic disruptions (i.e., rainfall) during the peak hours is a serious problem.
In this paper, we apply our proposed algorithm to the traffic network of Sunway city. Investigation is conducted in the traffic simulator SUMO [43].

Grid Traffic Network
A GTN, which is a complex traffic network, has been widely adopted in the literature [15,[44][45][46][47] to conduct similar investigations, and so it is selected for investigation in this paper to show that the proposed scheme is effective. This paper uses a 3 × 3 GTN with nine intersections, whereby a traffic light controller is installed at each intersection, which has 4 legs in four different directions (i.e., north bound, south bound, east bound, and west bound). Each leg has two lanes so that a vehicle can either enter or leave the leg of an intersection.

Simulation Settings
This subsection provides the specification of simulation setup. Two different traffic networks are investigated: a) SCTN with seven intersections (see Fig. 4), and b) a 3×3 GTN with nine intersections. While SCTN is based on a real-world traffic network, GTN is a complex traffic network traditionally used in traffic light control investigations [15,[44][45][46][47]. So, these traffic networks are chosen for the investigation of the effectiveness of the proposed scheme in both real-world and complex traffic networks. The simulations are conducted using Matlab [48] and traffic simulator SUMO [43]. The traffic control interface protocol of SUMO (i.e., TraCI4Matlab [49]) is used to interconnect Matlab and SUMO. Both SCTN and GTN are designed using NetEdit, which is the traffic network editor of SUMO. The resource files in XML provide the details of the speed limits and arrival rates of vehicles, which define the RC and NRC traffic congestion levels and their effects to the traffic networks. The total duration of the simulations is up to the 100 episodes. The steps of each episode are provided in the Steps 2 to 18 of Algorithm 3.

Parameters of Simulation and Performance Measure
The parameters of simulation, which allow the best possible results for a DQN agent are presented in Tab. 2. Up to 50,000 experiences can be stored in a replay memory, and up to 100 experiences can be sampled randomly to form a minibatch. The values of parameters, which are presented in Tab. 2, have shown to provide the best possible performance in the literature [44].
Tab. 3 presents the parameters of simulation for the Burr type XII distribution model, which has various intensities of rainfall, including no rain (NR), light rain (LR), moderate rain (MR), and heavy rain (HR) scenarios [27]. The lower and higher scale parameter β value shrinks and stretches the distribution, respectively. The shape parameters k and c are reciprocals of the scale parameter β. The shape parameters k and c, as well as the scale parameter β, increase with the intensity of rainfall [27]. The performance measure used in this paper is the cumulative delay of the vehicles. Our proposed scheme aims to reduce the cumulative delay required by vehicles to cross multiple intersections. The cumulative delay also includes the average travelling and waiting times during congestion caused by RC and NRC. The total number of vehicles is 1000 per episode.

Results and Discussion
This section compares the performance measures achieved by our proposed MADQN, MARL and the baseline approaches, under RC and NRC traffic congestions. Fig. 6    The accumulated delayed reward for MARL and MADQN under RC and NRC traffic congestions increases with episode in SCTN, as well as in GTN as shown in Fig. 7. The accumulated delayed reward for both MARL and MADQN approaches becomes steady after 50 episodes. As compared to MARL, the accumulated delayed reward achieved by MADQN is higher in both types of traffic congestions (i.e., RC and NRC) and traffic networks (i.e., SCTN and GTN). Overall, MADQN increases accumulated delayed reward by up to 10% and 12.5% under RC and NRC traffic congestions in the SCTN, and up to 8.3% and 7.2% under RC and NRC traffic congestions in the GTN, respectively.   This paper investigates the application of multi-agent deep Q-network (MADQN) to traffic light controllers at multiple intersections in order to address two types of traffic congestions: a) recurrent congestion (RC) caused by high volume of traffic; and b) non-recurrent congestion (NRC) caused by traffic disruptions, particularly bad weather conditions. From the traffic light controller perspective, MADQN adjusts traffic phase split according to traffic demand in order to minimize the number of waiting vehicles at different lanes of an intersection. From the MADQN perspective, it enables traffic light controllers to use deep neural networks (DNNs) to store and represent complex and continuous states, exchange knowledge (or Q-values), learn, and achieve optimal joint actions while addressing the curse of dimensionality in a multi-agent environment. There are two main features in MADQN, namely target network and experience replay, which provide training with stability in the presence of multiple traffic light controllers. MADQN is investigated in a traditional GTN and a real traffic network based on the Sunway city. Our simulation in Matlab and SUMO shows that MADQN outperforms MARL by reducing the cumulative delay of vehicles by up to 27.7% and 27.9% under RC and NRC traffic congestions in the SCTN, and up to 28.5% and 27.8% under RC and NRC traffic congestions in the GTN, respectively.
There are six future works that could be pursued to improve MADQN. Firstly, relaxing the assumption in which the left-turning (or right-turning) traffic movement is not protected or can conflict with other traffic movements in a left-hand (or right-hand) traffic network. Secondly, prioritizing the experiences during experience replay for faster learning in a multi-agent environment with multiple intersections. Thirdly, addressing the effects of dynamicity to MADQN, including the dynamic movement of vehicles. Fourthly, providing fairness and prioritized access among traffic flows at intersections. Fifthly, other kinds of disruptions of traffic, including crashes, could be considered into the state space as they tend to cause serious traffic congestion. Lastly, real field experiment can be conducted to train and validate the proposed scheme using real-world feedback. The real field data can be collected so that the traffic network and the system performance achieved in the simulation can be calibrated.
Funding Statement: This research was supported by Publication Fund under Research Creativity and Management Office, Universiti Sains Malaysia.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.