Reinforcement Learning-Based Optimization for Drone Mobility in 5G and Beyond Ultra-Dense Networks

Drone applications in 5th generation (5G) networks mainly focus on services and use cases such as providing connectivity during crowded events, human-instigated disasters, unmanned aerial vehicle traffic management, internet of things in the sky, and situation awareness. 4G and 5G cellular networks face various challenges to ensure dynamic control and safe mobility of the drone when it is tasked with delivering these services. The drone can fly in three-dimensional space. The drone connectivity can suffer from increased handover cost due to several reasons, including variations in the received signal strength indicator, co-channel interference offered to the drone by neighboring cells, and abrupt drop in lobe edge signals due to antenna nulls. The baseline greedy handover algorithm only ensures the strongest connection between the drone and small cells so that the drone may experience several handovers. Intended for fast environment learning, machine learning techniques such as Q-learning help the drone fly with minimum handover cost along with robust connectivity. In this study, we propose a Q-learning-based approach evaluated in three different scenarios. The handover decision is optimized gradually using Q-learning to provide efficient mobility support with high data rate in time-sensitive applications, tactile internet, and haptics communication. Simulation results demonstrate that the proposed algorithm can effectively minimize the handover cost in a learning environment. This work presents a notable contribution to determine the optimal route of drones for researchers who are exploring UAV use cases in cellular networks where a large testing site comprised of several cells with multiple UAVs is under consideration.


Introduction
In 5th generation (5G) wireless networks, drone technology has a significant impact due to its wide range of applications. Large companies and entrepreneurs worldwide carry out numerous tasks using drones, and thus the popularity of these flying mini robots is increasing rapidly [1]. Drones play roles in education, defense, healthcare, disaster relief, surveillance, telecommunications, space, journalism, food services, and emergency response applications [2,3]. Since the deployment of 5G, drone applications have gradually increased with the reshaping of use cases and technology. Over the last decade, the growth of unmanned aerial vehicles (UAVs) has been magnificent, and low altitude commercial drone endeavors have led to a sufficiently high air traffic. With this increased air traffic, the safe flight operation while maintaining reliable connectivity to the network has been the most critical issue in drone mobility faced by the cellular operators [4,5]. For safe and secure flight, some critical use case applications require milliseconds end-end latency, e.g., in medical operations and emergency response teams. Moreover, cyber confrontations, confidentiality and privacy concerns, and public protection are also leading challenges. Time-sensitive drone applications required seamless connectivity via cellular infrastructure with ultra-reliable low-latency communications [6].
5G empowers a new era of the internet of everything. A user will facilitate high data rate internet speed with ultra-reliable low latency communications (URLLC) services. The enhanced mobile broadband (eMBB) services will enable high-speed internet connectivity for several use cases, such as public transportation, large-scale events, and smart office. In contrast, low-power, wide area technologies include narrow-band internet of everything for massive machine-type communications (mMTC) [7,8]. Cellular technologies such as the 5G new radio spectrum provide abundant higher data throughput rates, and ultra-dense networks offer additional capacity using offloading. The inclusion of higher-order modulation and coding schemes, such as millimeterwave in 5G new radio, can enable data rates beyond 10 Gbps while using less bandwidth. Moreover, massive multiple-input multiple-output with beam-steering offers energy efficiency and user tracking services. The ultra-lean design of 5G new radio architecture promises to reduce energy consumption and interference by combining multiple subchannels within a single channel. Furthermore, the transmit power of the base station (BS) can be focused in a particular direction to increase the coverage range of the cell [9]. 5G also enables traffic management of unmanned aircrafts at a commercial level. New drone applications will be entertained beyond visual line of sight flights, where low-altitude operations such as those below 400 ft/120 m are allowed worldwide. Sensor data transmission will be used for live broadcasting data transmission [10]. Although 5G NR ensures seamless and ubiquitous connectivity for low-mobility users, there remain some key challenges to be addressed in case of the high-mobility users, particularly for the UAV-based communication. A UAV may carry different apparatuses up to hundreds of kilograms depending on weight, route interval, and battery capacity.
Moreover, in contrast to terrestrial vehicles, UAVs suffer from several key limitations such as inadequate communication links, limited energy sources, variation in the network topology, and Doppler effect due to high air mobility. Machine learning (ML) techniques are anticipated to deliver improved network performance solutions, channel modeling, resource management, positioning, interference from the terrestrial node, and path-loss in drone handover, all with minimum computation. ML algorithms have been proposed as key enablers for decisions making in UAV-based communications, e.g., in the UAV swarm scenario, numerous drones share network resources in an optimal manner [11].
These technologies support many UAV services, and as a result, the drone will obtain more than 500 km/h high-speed mobility with a maximum latency of 5 ms. There are also some mobility challenges in modern technologies with respect to drones in a 5G cellular network [12]. First, rapid changes in reference signals such as the received signal strength indicator (RSSI) fluctuate due to flying within a 3D space at high speed. Therefore, the RSSI will rise and fall suddenly, and the drone will face situations where the handover decision is challenging. Second, high constructive and destructive interference of uplink (UL) and downlink (DL) channels from neighbor cells due to drone line of sight propagation conditions also results in handover. Third, the main lobe of BS antennas covers a large portion of the cell with height limitations due to their tilt settings, thus focusing only on ground users or the users inside the buildings. This height limitation results in availability of only the side lobes of BS antennas for UAV connectivity. Since the drone frequently flies along the side lobes of BS antennas, the signals at the antenna lobe edges may drop abruptly, which causes handover [13]. The drone might also fly on the strongest signals from far away BS antennas rather than the closest one with a significantly weaker signal. Since the side lobes of the antenna have limited gain, they can only ensure connectivity over a small area when compared to the main lobe of the BS. This results in unnecessary handover from the main lobe of the serving BS to side lobe of a neighboring BS, and hence the drone's flight may face dis-connectivity [14]. Thus, frequent handover occurs due to the split coverage area provided by BS side lobes and to maintain the best reference signal received power (RSRP) value [15,16]. As mentioned above, unnecessary ping-pong handover will turn into high signaling costs, dis-connectivity, and radio link failure. Hence, ultra-reliable low latency communication between the drone and BS requires an efficient handover mechanism for drone mobility management in the ultra-dense network.
The remainder of this article is organized as follows. Section 2 presents the related work. The problem statement, motivation, and the proposed solution are presented in Section 3. In Section 4, the handover scenario under consideration for UAV mobility along with the reinforcement learning-based solutions to optimize the mobility are discussed. In Section 5, we show a Q-learning-based optimized handover scheme. In Section 6, experimental results and a discussion are presented, and we conclude the paper in Section 7.

Related Work
The 3rd generation partnership project (3GPP) specifications of Release-15 and the 5G Infrastructure Public Private Partnership (5G PPP) reports (D1.2-5, D2.1) for drone-based vertical applications in the ultra-dense 5G network were studied in Refs. [17,18]. These studies found that drone mobility is one of the main concerns in existing 5G networks to provide reliable connectivity in ultra-dense scenarios. The simulation results in [19] were acquired under different UAV environments and with UAVs flying at different heights and velocities. The results showed that a UAV may have depreciated communications performance compared to ground UEs since they fly on lobe edges; furthermore, DL interference may cause low throughput and low SINR. In [20], the authors discussed the solutions to mitigate the interference in both the uplink and downlink to maintain an optimal performance. Additionally, mobility scenarios were considered to ensure that the UAVs remained connected to the serving BS despite the increase in altitude and situations where the neighbor BS transmitted signals at full power. In [21], the authors discussed the impact of change in cell association on the SINR of the UAV flying at different altitudes. They also compared performance gains for the 3D beamforming technique and fixed array pattern in existing LTE through simulations. The trajectory design, millimeter-wave cellular-connected UAVs, and cellular-connected UAV swarm are still open issues to accomplish high data rate and ultra-reliable low latency for robust connectivity in 5G drone use cases.
In [22], the authors discussed challenges such as low power, high reliability, and low latency connectivity for mission-critical machine-type communication (mc-MTC). To meet these mc-MTC applications' requirements, the authors considered drone-assisted and device-to-device (D2D) links as alternative forms of connectivity and achieved 40 percent link availability and reliability. In D2D and drone-assisted links, the handover ratio is maximized; however, reliability is still an open issue when dealing with > 500 km/h mobility. Simulation results in [23] showed successful handover in a 4D short and flyable trajectory using multiple-input-multiple-output ultrawideband antenna that considers kinematic and dynamic constraints. However, the authors did not address the issue of accuracy in following the 4D planned trajectory with the drone at low altitude, resulting in several unnecessary handovers while completing a trajectory. In [24], the authors proposed an analytical model using stochastic geometry to illustrate the cell association possibilities between the UAV and BS by considering the handover rates in a 3D n-tier downlink network. However, the proposed model did not result in cost-efficient handover in drone flights with a constant altitude. In [25], the authors proposed interference mitigation techniques for uplinks and downlinks to ensure that the UAV remains in LTE coverage despite increased altitude or worst-case situations in which the neighbor BS transmits a signal at full power. With their proposed technique, a strong target cell for handover could be identified to maintain drone connectivity, but the unneeded handover increased the handover cost.
In [26], the authors proposed a UAV and network-based solutions for UAV performance. The authors considered coverage probability, achievable throughput, and area spectral efficiency as performance metrics. They concluded that as the UAV's altitude rises, the coverage and performance decline; accordingly, drone antenna tilting/configuration can increase the drone's coverage and throughput. For reliable connectivity, the proposed solution enhances the coverage area, channel capacity, and spectral efficiency. However, this solution is not cost-efficient as a fast-moving drone may face number of handovers wherever it meets the strongest BS signal. Meanwhile, in [27], the authors proposed a framework to support URLLC in UAV communication systems, and a modified distributed antenna system was introduced. The link range between the UAV and BS was increased by optimizing the altitude of the UAV and the antenna configuration; additionally, increasing the antenna's range also improved the reliability of drone connectivity. Making decisions using ML will surely reduce the handover frequency rate and achieve the required latency and reliability for URLLC applications in the 5G drone system. In [28], the authors detailed drone handover measurements and cell selection in a suburban environment. Experimental analysis showed that handover rate increases with an increase in drone altitude; however, these results only focused on drone altitude. Drones may face several cell selection points in a fast-moving trajectory where an intelligent algorithm is needed for cost-efficient decisionmaking. In [29], the authors proposed a fuzzy inference method in an IoT environment where the handover decision depends on the drone's characteristics, i.e., the RSS, altitude, and speed of the drone. Perceptive fuzzy inference rules consider the rational cell associations that rely on the handover decision parameters. Simultaneously, an algorithm that can learn an environment can make a better decision about whether a handover is needed. In [30], the authors proposed a handover algorithm for the drone in 3D space. Their technique is cost-efficient because it avoids recurrent handovers.
Furthermore, Ref. [30] evaluated the optimal coverage decision algorithm based on the probability of seamless handover success rate and the false handover initiation. Their algorithm focused on maximum reward gain by minimizing handover costs. Still, the trajectory was not the optimal route for drone flight. To address this drawback, ML tools can be applied to learn the environment and provide the optimal route. In [31], the authors evaluated the handover rate and sojourn time for a network of drone-based stations. Another factor to consider is that the drone's speed variations at different altitudes introduce Doppler shifts, cause intercarrier interference, and increase the handover rate. In [32], the authors proposed a scheme to avoid unnecessary handovers using handover trigger parameters that are dynamically adjusted. The proposed system enhanced the reliability of drone connectivity as well as minimized the handover cost. Regardless, the complete trajectory was not cost-efficient because the proposed technique did not employ any learning model such as reinforcement learning, and therefore the drone cannot decide which path is most cost-efficient in terms of handover.

Problem Statement
In an ultra-dense small-cell scenario, the coverage area among cells is small, and drones may observe frequent handover due to their fast movement. Furthermore, channel fading and shadowing cause of ping-pongs. According to 3GPP, user equipment and drones are focused on strengthening RSRP, as illustrated in Fig. 1. Δ and β values are used to overcome the unnecessary handover. An A3 event is triggered when the RSRP of the neighboring cell becomes higher than the RSRP of the serving cell, resulting in a handover. This is a continuous process, and whenever a drone finds a stronger signal, handover occurs. Thus, these unacceptable handovers are mainly caused by delay and loss of packets, and the link remains unreliable, particularly in the case of mission-critical drone use cases.

Motivation
In 5G dense small-cell deployment, drones face frequent handovers due to the short range of cells. This results in a high signaling cost, cell-drone link reliability issues, and bad user experience, particularly in time-sensitive drone use cases. In real-time scenarios, 5G can meet the requirements of several use cases; however, optimized handover remains an open issue. The baseline handover mechanism in 5G requires critical improvements to ensure seamless connectivity while maintaining a lower handover cost. Hence, a reinforcement learning-based solution will optimize the existing solution since we will compromise the signal strength at some points. By taking the signaling overhead to account, the optimized tradeoff between RSRP of serving cells and handover occurrence will efficiently lower the handover cost.

Proposed Solution
This study optimizes the handover procedure for a cellular-connected UAV drone that ensures robust wireless connectivity. The drone handover decisions for providing efficient mobility support are optimized with Q-learning algorithms, which are reinforcement learning techniques. The proposed framework considers handover rules from the received signal strength indicator (RSSI) and UAV trajectory information to improve mobility management. Handover signaling overhead is minimized using the Q-learning algorithm, within which the UAV needs to decide whether handover is required, and which handover is the most efficient path. The proposed algorithm depends on RSRP, which aids in the efficient handover decision and minimizes the handover cost. Moreover, the tradeoff between RSRP of serving cells and handover occurrence clearly shows that our Q-learning technique helps optimize this tradeoff and achieve minimum cost for the UAV route, all while considering the handover signaling overhead.

System Model
As shown in Fig. 2, the UAV is served by a cellular network within which several BS actively participate. We assume that UAVs fly at a fixed altitude with a two-dimensional (2D) trajectory path, and all information regarding UAV is known to the connected network. The UAV needs to connect to different BSs and perform more than one handover along its route to accomplish the trajectory path with reliable connectivity. Handover continuously changes the association between the UAV and BS until the UAV reaches its destination. Our study assumes predefined positions along the trajectory wherever the UAV needs to change its association to the next BS or for better connectivity. At every location, the UAV needs to decide whether to do a handover. The steps and signal measurement report involved in handover commands/procedure and admission control are shown in Fig. 3. The BS distribution, UAV speed and trajectory path, RSSI, RSRP (dBm) = RSSI − 10 × log(12 × N), and reference signal received quality RSRQ = (N × RSRP/RSSI) govern the result of a complete handover process.
UAVs are always hunting for more reliable and robust BSs, such as those with a maximum RSRP value; however, this behavior may be disadvantageous for signaling overhead and reliable connectivity. For instance, every time upon receiving an RSRP value from a neighboring cell BS that is higher that the RSRP value of the serving cell BS, the UAV will trigger handover along its trajectory path, which is costly. This baseline approach introduces ping-pong handover with connectivity failures, such as hasty shifting in RSRP. This expensive solution leads us to construct an efficient UAV handover mechanism in a cellular network that ensures robust wireless connectivity with minimum cost. In this study, we propose a framework for the handover decisions based on the Q-learning technique, which ensures robust connectivity while considering the handover signaling cost. Our proposed Q-learning-based framework will view measurement reports (RSSI, RSRP, and RSRQ) and handover cost as key characteristics for handover decisions. The proposed framework also considers the tradeoff between RSRP values (required maximum) and several handovers (required minimum). Moreover, in the handover decisions, we consider HO and RSRP as weights to adjust the tradeoff between RSRP of the serving cell and the number of handovers. We consider RSRP a substitution for the robust connectivity and number of handovers as a signaling overhead along the whole trajectory. Inherently, our proposed Q-learning-based framework will maintain a good RSRP value with a minimum number of handovers throughout the trip.

Background of Q-Learning
Reinforcement learning is part of the broad area of ML. It is all about what to do and how to map situations to an action, where an agent takes appropriate action to maximize the reward in a particular state. As shown in Fig. 4, first, the RL agent observes state St and then takes an action A t at time t. In response to the action, the agent receives feedback about that action (i.e., reward R t ), and to increase the anticipated action's reward accumulated over time, the agent must choose the appropriate actions. This continues until the algorithm obtains the maximum reward value. RL, an ethical framework described by the Markov decision process (MDP), depends on its problem statement. MDP can be represented as a tuple (S, A, {P sa } , λ, R), where S denotes the number of states, A is the set of actions taken by the agent, and P sa provides probability of state transitions for state belongs to the set of states and set of actions. The discount factor is denoted by λ ∈ [0, 1], and R is the reward obtained by R : S × A → R. MDP always aims to obtain the optimum policy, which depends upon the action taken at each state while looking forward to the maximized reward.  [34] is an off-policy, model-free, and values-based reinforcement learning algorithm. The objective is to maximize the reward and learn the optimum policy for the given Markov decision process. Let us assume the Q-value Q π (s, a) that anticipated the maximized reward for policy π when the Q-learning agent takes an action a in state s and then selects an action regarding policy π . After some learning episodes, the agent will ultimately learn optimal Q-values Q * (s, a), and the highest Q-value for each state establishes an optimal policy. In this study, we donate the Q-value as Q t (s, a) at time t throughout the process; after receiving the updated reward R t for the current state s, the learning agent takes action A o to get a transition to next state s with reward R t+1 . Evaluation of the updated Q-value can be performed using In Eq. (1) Q t+1 is the next state value, λ is the discount factor, and α is the learning rate. After performing 250 computations, the Q-learning algorithm learns the optimal Q-values for all states using successive approximation.

Q-Learning-Based Drone Mobility Framework
This section will briefly describe the state, action, and reward to decide whether the handover is needed along a trajectory path. Moreover, we propose a Q-learning-based algorithm for making the optimal handover decision for the given trajectory path. The main parameters used for the proposed Q-learning-based algorithm for handover optimization are listed in Tab. 1.

Definitions
State: In Fig. 5a, we considered three parameters: the drone's position, represented by P s o : x s o , y s o ; the drone's movement direction θs o , where θ could be {kπ/4, k = 0, . . . , 7}; and the currently connected cell, represented by C s o ∈ C (set of all neighbor cells). In our proposed algorithm, we detail the trajectory's initial (T o ) and final (T m ) positions. The drone's selected path is the shortest trajectory from the initial to the final position, and the drone always connects to the next predefined BS along the optimized trajectory. In our proposed model, the complete trajectory is not necessarily a straight line because of the fixed number of possible movement directions. Reinforcement learning algorithms are commonly used in drone trajectory path planning. Compared to conventional techniques, the proposed methodology considers the optimized trajectory computed by Q-learning, rather than by adopting a fixed predefined trajectory. C V s ← C V s ; 9: end for 10: Reward matrix R ← Q; 11: While training step < n do 12: j = 0; -greedy algorithm: 13: for i in length (R) − 1 do 14: If > uniform random value on interval,   Reward: In our optimized drone's handover mechanism, we aim to decrease the number of handovers by maintaining reliable connectivity, as shown in Fig. 5c. The drone should also connect to the lower RSRP cell in the trajectory path, and frequent handover can be avoided. Our proposed model considers handover cost weight HO and the serving cell RSRP weight RSRP at a future state in the reward function given by These weights in Eq. (2) with the indicator function C (HO) balance our two contradictory goals. The handover cost C (HO) will be "1" if the serving cells at state s and s are different; otherwise, the cost will be "0".

Q-learning-based Algorithm for Handover Optimization
In our proposed model, at every single state, action space A is constrained to the strongest V candidate cells denoted by a set I V = {0, 1, . . ., V − 1}. Q-table Q ∈ R l×V×V is updated according to Eq. (1) for a drone trajectory path. The complexity of Algorithm 1 is given by O (cn), where "n" denotes the number of training episodes, and "c" is a constant value equivalent to the total route length "l". Steps 2-9 produce the preliminary Q-table for the given drone's path, and a binary square matrix (size V) is generated in step 6.
Furthermore, if the p-th strongest cell at state s is different from the q − th strongest cell in state s in (p, q) − th entity of matrix, then the entity is "1"; else, it is "0". Steps 11-24 execute and update each training episode's Q-value; for example, the greedy exploration runs in steps 14-18, and the Q-value is updated in the table at step 20. Finally, the Q-table contains values that can be chosen for different actions, and the highest value indicates the optimal choice. Consequently, for an efficient mechanism, the handover decisions can be attained from the maximum Q-value at each state along the given trajectory. The block diagram for the proposed model is shown in Fig. 6.

Simulation and Results
This section will evaluate the proposed Q-learning-based handover scheme with the 3GPP access-beam-based method (greedy handover algorithm), where drones are always connected to the strongest cell. We calculate the handover ratio as a performance metric for every drone trajectory. In the baseline scenario, the drone is always connected to the strongest cell, and the handover ratio will be constant at 1. The performance evaluation for diverse weight combinations of HO and RSRP in the reward function represents the tradeoff between upcoming RSRP values and the number of handovers. The handover ratio approaches zero and the number of handovers decreases when the ratio HO RSRP increases. Meanwhile, the proposed Q-learning-based handover and baseline algorithm will show similar results in a special scenario where there is no handover cost C (HO), that is, when no handover occurs.
Extensive simulations are conducted to evaluate the performance based on the number of episodes and accumulated reward gained in each episode. The proposed algorithm is evaluated by varying the parameters (α, , γ ) of Q-learning to find the optimal results. The results show that the proposed algorithm converges on the maximum reward for each randomly generated route. As shown in Tab. 2, parameters are set based on exploration and exploitation with a greedy algorithm: α = 0.1, 0.5, 0.9, = 0.1, 0.5, 0.9, and γ = 0.1, 0.5, 0.9. The variations in α, , andγ show the optimized results for the drone's trajectory. In Fig. 7, we show the accumulated reward when the learning rate (α) varies in the range of 0.1−0.9; meanwhile, the values for epsilon (ε) and discount factor (γ ) are 0.5 and 0.9, respectively. After 7 initial episodes, the best-accumulated reward of the proposed algorithm is at learning rate α = 0.9. For α = 0.5, the accumulated reward stays higher than α = 0.1 from episode 7 to 70 but degrades afterward up to episode 250. The best parameters are α = 0.9, = 0.5, andγ = 0.9. Since the proposed Q-learning-based algorithm reduces the ping-pong handovers, RSRP is also reduced.
In Fig. 8, we show the accumulated reward as the discount factor (γ ) varies in the range of 0.1 − 0.9; meanwhile, the values for learning rate (α) and epsilon (ε) are 0.3 and 0.5, respectively. After 70 initial episodes, the best-accumulated reward of the proposed algorithm is at γ = 0.5. For γ = 0.9, the accumulated reward stays higher than γ = 0.1and0.5, from episode 20 to 70 but degrades afterward up to episode 250. Meanwhile, γ = 0.1 yields the worst accumulated reward. As shown in the results, the best parameters are α = 0.3, = 0.5, andγ = 0.5. There will be no handover when the proposed scheme is equivalent to the baseline scheme.
In Fig. 9, we show the accumulated reward when epsilon (ε) varies in the range of 0.1 − 0.9; meanwhile, the values for α and γ are 0.3 and 0.9, respectively. After 18 initial episodes, the best-accumulated reward of the proposed algorithm is at = 0.9. For = 0.1, the accumulated reward stays higher than = 0.1, from episode 1 to 31 but degrades afterward up to episode 250.
As shown in the results, the best parameters are α = 0.3, = 0.9, andγ = 0.9. To avoid unnecessary handovers, we need to decrease the ratio HO RSRP , and then the cost of handover also decreases. In Fig. 10, we show the accumulated rewards of the proposed algorithm in each scenario (α = 0.9, γ = 0.5, and = 0.9). The simulation results show that the right parameters will affect drone performance throughout the learning phase and also enhance the learning curve. The proposed technique demonstrates that the learning process is the best way to optimize drone mobility in dense scenarios.

Conclusions and Future Work
In this work, we proposed a machine learning-based algorithm to accomplish strong drone connectivity with less handover cost such that the drone will not always connect to the strongest cell in a trajectory. We suggested a robust and flexible way to make handover decisions using a Q-learning framework. The proposed scheme reduces the total number of handovers, and we can observe a tradeoff between received signal strength and the number of handovers while always connecting the drone to the strongest cell. This tradeoff can be adjusted by changing the weights in the reward function. There are many potential directions for future works such as exploring which additional parameters may further enhance reliability during handover decision-making. This work presents a notable contribution to determine the optimal route of drones for researchers who are exploring UAV use cases in cellular networks where a large testing site comprised of several cells with multiple UAVs is under consideration. Finally, the proposed framework studies 2D drone mobility; a 3D mobility model would introduce more parameters to aid efficient handover decision.