Data-Driven Self-Learning Controller for Power-Aware Mobile Monitoring IoT Devices

: Nowadays, there is a significant need for maintenance free mod-ern Internet of things (IoT) devices which can monitor an environment. IoT devices such as these are mobile embedded devices which provide data to the internet via Low Power Wide Area Network (LPWAN). LPWAN is a promising communications technology which allows machine to machine (M2M) communication and is suitable for small mobileembedded devices. The paper presents a novel data-driven self-learning (DDSL) controller algorithm which is dedicated to controlling small mobile maintenance-free embedded IoT devices. The DDSL algorithm is based on a modified Q-learning algorithm which allows energy efficient data-driven behavior of mobile embedded IoT devices. The aim of the DDSL algorithm is to dynamically set operation duty cycles according to the estimation of future collected data values, leading to effective operation of power-aware systems. The presented novel solution was tested on a historical data set and compared with a fixed duty cycle reference algorithm. The root mean square error (RMSE) and mea-surements parameters considered for the DDSL algorithm were compared to a reference algorithm and two independent criteria (the performance score parameter and normalized geometric distance) were used for overall evaluation and comparison. The experiments showed that the novel DDSL method reaches significantly lower RMSE while the number of transmitted data count is less than or equal to the fixed duty cycle algorithm. The overall criteria performance score is 40% higher than the reference algorithm base on static confirmation settings. The article deals with the design and application of the control algorithm for a prototype of an efficient Low-Cost, Low-Power, Low Complexity—hereinafter (L-CPC) bidirectional communication system for the reading and configuration of embedded devices. Low Power Wide Area Networks (LPWANs) and the fifth-generation technology standard for broadband cellular networks (5G) are promising technologies for the connection of compact monitoring mobile


Introduction
The article deals with the design and application of the control algorithm for a prototype of an efficient Low-Cost, Low-Power, Low Complexity-hereinafter (L-CPC) bidirectional communication system for the reading and configuration of embedded devices. Low Power Wide Area Networks (LPWANs) and the fifth-generation technology standard for broadband cellular networks (5G) are promising technologies for the connection of compact monitoring mobile Several research articles have used various implementations of RL principles, especially QL in monitoring IoT devices at a network level (see Tab. 1). QL algorithms can be used to iteratively change the MAC protocol parameters by a defined policy to achieve to a low energy state [32]. The TDMA-based adaptive task scheduling [33] method or two-tier data dissemination schemes based on Q-learning (TTDD-QL) [34] are energy efficient for wireless sensor networks (WSN). A cooperative energy-efficient model is presented in the article [35], where clustering, mobile sink deployment and variable sensing collaboratively improve the network lifetime. Besides routingbased or cooperative optimization, there are other research challenges which implement the QL procedure in mobile IoT devices or WSN. Future incoming solar energy can be predicted with Q-learning solar energy prediction (QL-SEP) [36], which is useful for solar-powered devices. In [37], an optimal energy management strategy of a portable embedded system based on QL was proposed to extend system lifetime. The QL algorithm also proved to be a suitable solution in terms of energy for wireless embedded systems such as sensor nodes and smartphones [38]. A dynamic energy-efficient system based on the QL technique to control the energy management issue is used in real-time systems in embedded devices [39]. Based on the presented state-ofthe-art review (see Tab. 1), the authors stated that several research works describe the use of QL to achieve power effective solutions in embedded systems, although research works exploring data-driven power-aware approaches using QL have not been published yet.
In this article, the application of a novel DDSL control approach for mobile monitoring IoT devices based on wake-up scheduling ( Fig. 1) is presented. The core of the algorithm is to dynamically set an operation period through a wake-up timer configuration according to the correct estimation of future collected data values, which potentially leads to effective operation of power-aware systems. For evaluation purposes, there were used historical incoming solar irradiance data from an environmental monitoring device. The presented self-learning algorithm was also evaluated by a set of various QL expert parameter configurations. Predicted values were compared to the collected values from sensors to provide input parameters for the learning process. The testing procedure compares a complete set of collected data and a reduced set with linear interpolation. The article's novelty lies in modification of the QL approach to allow energy efficient data-driven behavior of embedded IoT devices. QRTT Prediction algorithm + Useful for wireless embedded devices + Improved performance Zhang et al. [39] DQL-EES Energy-efficient scheduling + Energy efficient + Useful for real-time system in embedded devices Figure 1: General application principle of a DDSL controller: The mobile device collects and stores parameters of interest into memory. The DDSL controller sets a data collection duty cycle and updates the algorithm through data-driven learning The remainder of the article is organized as follows: the background section describes poweraware challenges, the general Q-learning algorithm principle and future value estimation by polynomial approximation. The experimental section describes a designed controller, reference algorithm and the evaluation criteria. The experiment summary is elaborated in the results section, followed by a technical discussion. The final section concludes the article and discusses several research challenges as future work.

Materials and Methods
This section introduces the theoretical background for a general description of the Q-learning algorithm and mathematical formalization of the applied polynomial approximation.

Q-Learning Algorithm
QL belongs to a family of reinforcement learning methods which explore an optimal strategy for a given problem. This semi-supervised model free algorithm was introduced by Watkins [40] and is formulated as a finite Markov decision process, which is a mathematical formalization of the underlying decision-making process.
The QL defines an agent which is responsible for the selection of action A t from a set of actions. The agent is learned through its interaction with its environment (Fig. 2). The QL strategy learns the agent to take the best action which maximizes its long-term reward. The agent regularly updates its achieved rewards according to the selected action at a specific state. The QL approach also uses a memory-stored array which is called Q-table, and its size is defined by the number of states S and actions A. The array's columns represent the quantitative values of possible actions. The QL algorithm is controlled by the following equation: where α is a learning rate which controls the convergence speed of the learning process. When α = 0, the algorithm uses only previous estimates of the reward signal; otherwise α = 1, and the algorithm applies only new knowledge. Q(S t , A t ) represents an estimated value of the reward in the Q-table for the current action A t and state S t . The variable R represents a received reward signal. A discount rate (γ ) determines whether the agent attempts to maximize the immediate reward (γ = 0) or to maximize the future cumulative reward (γ = 1).
The learning strategy is also influenced by a constant ε (epsilon-greedy policy), which causes the selection of a random action instead of the maximal reward action. From the 0 to 1 interval, ε is selected (e.g., 0.95 means 5% of random actions) [31].

Polynomial Approximation
The polynomial approximation interpolates values with a polynomial. The polynomial is a function which is written in the form: p(x) = a n x n + a n−1 x n−1 + · · · + a 1 x + a o , where a 0 , a 1 , . . . , a n are constants (coefficients of polynomial) and x n are variables. If a n = 0, n is a degree of polynomial p. The degree of polynomial n is defined by the greatest value of the exponent.
An approximation is an inaccurate expression of some function. In this paper, the polynomial coefficients are calculated using the least-squares approximation method by summing squared values of the deviations; this sum should be minimal (see Eq. (3)), where e i is deviation of the original value x i from the obtained polynomial p(x i ) (see Eq. (4)).

Dataset
The experiment uses the dataset from an environmental data collection station. The data include values of incoming solar energy as simulated input from a sensor. The solar energy values were collected continuously for five years at the Fairview Agricultural Drought Monitoring station (AGDM) located in Alberta, Canada [41], coordinates at 56.0815 • latitude, −118.4395 • longitude, and 655.00 m elevation. This dataset contains the total incoming solar radiance in W/m 2 collected per five-minute interval.

Experiment
The aim of the performed experiment is evaluation whether the DDSL controller is capable of finding an optimal strategy for dynamic configuration of the data collection period. A conventional QL algorithm was modified to be useful to the proposed experiment for its application in wake-up embedded devices. The experiment was performed in MATLAB, and a complete solution is simple to implement to mobile monitoring devices.

Controller Design
The proposed DDSL controller dynamically sets an operation period according to correct estimation of the collected data to adjust the operation duty cycle. The DDSL controller follows the RL model shown on the Fig. 2. The core of the DDSL controller algorithm is the selection of action A, the subsequent change of environment to state S, and the reward which depends on the selected action and caused state. The self-learning process of the DDSL controller is based on the QL approach.
Action A, which represents a period (time slot) T next , sets the next wake-up period of the monitoring device. Selection of the action, which is based on the DDSL controller policy, affects a change in the environment (Fig. 3). The environment determines the value of T next and the value of x in the time T next to the predict the engine block. This block estimates the predicted future value x next by a polynomial approximation with variable degree of the polynomial N. The predicted engine block also calculates the estimation accuracy , which is difference between the predicted value x p and the collected value of x from a sensor. In the Lookup Table (LUT) block,  the is used to determine the appropriate state S.

Figure 3:
Block diagram of the DDSL controller: The Q-learning block selects action A, which causes the environment feedback (state S and reward R) to control the self-learning process Based on the current state and performed action, partial rewards (the state reward (R S ) and the action reward (R A )) are estimated. The R S value is positive if the controller changes state from low to high accuracy, negative if controller changes state from high to low accuracy, and zero if there is no change. In general, the DDSL approach prefers high accuracy states. This scenario is described by following equation: The index_of() function returns an one-based order of elements in the state vector (higher index represents higher estimation accuracy). The R A has an assigned value based on the performed action. A slow operation period corresponds to low energy demands. This behavior is described by the equation: In this case the index of() function provides higher value for longer duty cycle. The total reward R is formulated as sum of R S and R A : The QL process is affected by a total reward R and current state S with variable configuration of expert constants (α, ε, γ ). The action A selected by the QL policy is the output of the QL block.
The DDSL approach is equipped by discounting the learning factor to achieve stability in the learning process. The discounting progress of the parameter α is shown in the Fig. 4. In each step of the algorithm, α is discounted by learning discount (LD), especially LD = 0.01%, which means that α decreases to 50% after the first 24 days (approx. one month) and 10% after the first 80 days (approx. three months). The conventional QL algorithm presented in the literature [42] is not directly applicable to the proposed experiment. Therefore, there were designed a modification of the original algorithm. The difference between the conventional QL and the modified version is shown in following algorithm descriptions.
The conventional Q-learning algorithm described in [42] is composed of the following commands: Choose A from S using policy derived from Q (e.g., epsilon-greedy policy) The modified Q-learning algorithm is composed of the following commands:

12: end while
In the original QL algorithm, the performed action step is inside the QL algorithm loop, but from the monitoring device point of view, the performed action itself is a duration of standby or sleep mode. In the modified scenario, the algorithm performs an action at a different stage than the original approach. The learning process part is completed based on the past state and current state because the future action is unknown.
In the conventional QL algorithm, an action is first selected according to the QL policy and the environment state. The action is performed, and a reward based on the previous state and actual action is calculated. In the next step, the Q-table is updated by the learning process and a new state S is observed. However, in the modified QL, the loop also starts by selecting and performing an action, but then implements a new variable called sleep. This variable represents the action, the selected sleep time. Then the reward from the previous state and action is calculated and the Q-table is updated by the learning process itself.
The modified QL algorithm is controlled by the following equation: where α is a learning rate. Q(S t−1 , A t−1 ) represents a value of the reward in the Q-table of the previous action A t−1 and previous state S t−1 . R represents immediate reward.
The polynomial approximation method is used to evaluate the next value x next of the collected data. In this experiment, the polynomial coefficients are calculated using MATLAB's polyfit function with a least-squares approximation. The input for the polyfit function is a time vector, a solar irradiance vector, and a degree of the polynomial N. The output of the polyfit function are coefficients of the polynomial p(x) which fits the input data. The coefficients are in descending powers and their length is dependent on the value of the degree of the polynomial N, specifically N + 1.
In the next step, MATLAB's polyval function is used to protect negative values in prediction. The polyval evaluated the polynomial p at each point x (see Eq. (2)). The p is a vector of the coefficients and the point x is the index value of the action in the specific simulation step. If the final polyfit evaluation value is greater than or equal to 0, the polyval function result is the x next value. Otherwise, a zero value is assigned to x next .

Reference Solution and Evaluation Criteria
To evaluate the DDSL controller approach, a reference algorithm with a linear interpolation method is used. The original collected data has a 5-min data collection interval. Therefore, the reference solution is based on an original data set where only 10-, 15-, 20-, 25-and 30-min intervals are extracted. To fill in the missing data between the extracted samples, the linear interpolation method was used.
To compare the accuracy of prediction between individual settings of expert constants and the reference solution, the root mean squared error (RMSE) was calculated by following equation, where n is the size of the data set, y i is the value from the original data set and x i is the evaluated value from the reference or DDSL data set. The RMSE value is smaller for a more accurate algorithm. The DDSL controller policy can achieve minimization of the RMSE by the R S reward component.
The Number of Measurement (NoM) is the second evaluation parameter which follows the number of the operation period. The algorithm policy is principally designed to minimize NoM (R A reward component) since this behavior leads to minimal power consumption.
The performance score (PS) is then the overall evaluation parameter, which considers both above-mentioned parameters (RMSE and the NoM) and is calculated according to the following equation: where maxRMSE REF is the maximal RMSE value of the reference algorithm. The RMSE α,γ and NoM α,γ are the RMSE and NoM values of the experiment with specific DDSL controller settings. A higher PS value means that the algorithm setting is more efficient. In general, the PS value of the reference dataset at the 30-min interval is 0 because RMSE α,γ equals max RMSE REF .
Evaluation parameters NoM and RMSE score opposite sides of the controller's behavior. These criteria are designed to find a trade-off between reduced NoM and satisfying RMSE.
Generally, an overall evaluation considers two parameters (RMSE and NoM). Technically, these parameters oppose each other, and a trade-off between RMSE and NoM should be considered. To evaluate the DDSL approach, a cartesian distance to zero is used.
The RMSE and NoM parameters are normalized according to the worst case, meaning a 30-min reference algorithm RMSE parameter and a 5-min reference algorithm NoM parameter: NoM = NoM NoM ref5 min (12) An overall cartesian evaluation parameter L is calculated by following formula:

Results
This section provides the results of a comprehensive set of experiments which verify the designed controller with various QL parameters settings and the degree of the polynomial. Each experiment configuration was repeated ten times to eliminate the effect of the epsilon-greedy policy. Experiments were performed with the following settings for α 0 , γ and the degree of the polynomial: •   5 shows an overall comparison of the reference algorithm and the DDSL controller and the highest PS results for various degrees of the polynomial for the DDSL controller. The DDSL controller provides approximately 40% higher PS than the best reference algorithm, with the exception of the degree of polynomial N = 5, which provides only 23% higher PS. The Tab. 2 provides a numerical summary of the highest PS algorithm settings. The highest PS results provided the algorithm with low α 0 settings and high γ settings, which indicates a slow learning process and cumulative reward preference. The exception is the algorithm with the degree of polynomial 3, where the γ setting is lower than others.  Fig. 6a shows the PS comparison between different settings of the degree of the polynomial. There can be seen that all degrees of the polynomial settings achieved higher performance than the best (5 min) reference algorithm. It is also notable that the highest performance was achieved by algorithms with a degree of polynomial 1 and 2. Additionally, these configurations of the DDSL algorithm do not demonstrate lower performance than the 25-min reference algorithm.  Results below the reference line reach better scores than the reference algorithm Fig. 6b shows the DDSL controller result in cartesian coordinates. The x-axis represents the NoM and the y-axis describes the RMSE. This representation provides a reference algorithm borderline which divides the two-dimensional cartesian coordinate system into two parts. The first part above the reference algorithm borderline means, that the algorithm achieves worse PS than the reference approach. The results beneath the reference algorithm borderline return at least a lower RMSE with the same NoM as the reference algorithm or lower NoM with the same RMSE as the reference algorithm, respectively. A geometric distance to zero is a crucial evaluation parameter. The weights of the x-and y-axis should be considered or normalized to balance the effect of the RMSE and NoM parameters (parameter L ). Fig. 7 shows 3D bar graphs for various α and γ for the degree of polynomial 1 and 2. There can be observed an increase of PS for α 0 = 0.1, 0.3 and can also be noticed that the area close to both the limit values of γ (0 and 1) are very satisfactory for the DDSL controller settings. There can be seen that the algorithm with degree of polynomial 1 and 2 provides the highest PS. The algorithms with the degree of polynomials 3, 4 and 5 provide significantly lower performance than the algorithms with a degree of polynomial 1 and 2. There can also be observed that the PS falls slightly when γ decreases. The PS decrease is more significant for a higher degree of the polynomial (4 and 5). A more dynamic algorithm which prefers instant rewards and uses a higher degree of polynomial to estimate the future achieves a lower PS than farsighted algorithms.  Fig. 9 represents the PS in a cartesian coordination system for the degree of the polynomial N = 1 and N = 2. Fig. 9 also distinguishes the α settings and best 5 PS results. There can be seen a different color area for various α 0 settings. A high α 0 setting (red) area is located close to minimal NoM and maximal RMSE parameters. It means that the dynamic algorithms with high α 0 settings provide high compression at the cost of increase in the RMSE parameter. The best algorithm settings are located in the area with the lowest RMSE and fall in the middle of the NoM. These algorithms provide the best trade-off between the RMSE and NoM parameters. There can be noted that the highest PS is achieved by low α settings.

Discussion
The results return several interesting areas to discuss. The first idea concerns the correct selection of the degree of polynomial. The presented experiment used a degree of polynomial from 1 to 5. Based on the input solar irradiance data, the DDSL approach provided the best performing result for the degree of the polynomial 2. The degree of the polynomial 1 also provided better performance than 3, 4 and 5 in this case. It must be highlighted that selection of the appropriate degree of the polynomial is directly linked to the type of data collected from the sensors. In our case, the best performing result was achieved by linear or quadratic approximation represented by the degree of the polynomial 1 and 2. In the case of a different dataset, correct selection of the coefficient could lead to higher degree of the polynomial. Regarding the key feature of the DDSL approach, exploratory studies for suitable degrees of the polynomial should be performed before mobile monitoring IoT devices are deployed in target application areas. The capability of the self-learning approach is limited without custom adjustment of the degree of the polynomial according to the character of the collected data.
The configuration of Q-learning parameters is second area to discuss. Deployment of the mobile monitoring devices should consider proper selection of the learning rate, discount factor, and the epsilon-greedy policy. The article's results showed that the initial learning rate should be set conservatively from 0.1 to 0.3. Therefore, the DDSL controller accepts new information slowly and keeps its already obtained knowledge stored in a Q-table. However, in terms of the discount factor, there is no conclusive result. With a degree of polynomial 1, the experiment showed that the best results are achieved from high cumulative discount factor approaches (0.8-1). However, the result which included a degree of polynomial 2 showed that an instant reward policy with a low discount factor (<0.4) could also lead to the best performance solutions. Therefore, the discount factor setting is not simply a subject of the input dataset but has a strong connection to the degree of the polynomial. The epsilon-greedy policy is set to 5% of random actions as standard in such applications, but the question is whether this leads to the best performance in long-term deployments where the learning rate is significantly reduced by the learning discount coefficient. This idea should be evaluated with long-term field testing or extensive simulations on an extended dataset. In this case however, the study does not provide a general answer for setting up the initial epsilon-greedy and discount policies.
The final discussion topic concerns the evaluation policy of the presented solution. There were designed two basic approaches, one which uses a linear ratio between the RMSE and NoM, and the second which is calculated by the geometrical distance in normalized cartesian space. Both evaluation methodologies followed the same aim, which was to determine an evaluation coefficient which targets the tradeoff between low RMSE and low NoM. Both methodologies provide similar results in an opposing manner, one maximizing the linear ratio and the other minimizing the normalized distance. In another implementation scenario, the evaluation strategy varied according to the specific optimization target.
Tab. 4 shows a general comparison of the DDSL controller approach with three QL state-ofthe-art methods. The stated studies [43][44][45] used data-driven QL approaches to solve their control requirement in addition to the DDSL controller. The major difference between the individual approaches is the way the QL algorithm is used, the possible additional methods for control, and the monitored subject matter. The advantages and limitations mentioned in the table are derived from these conditions. The proposed DDSL controller offers a unique approach in solving a data-driven self-learning principle for mobile monitoring embedded devices. The article proposed a modified QL-based algorithm which controls an operational cycle according to the acquired data. The general principle lies in observation of the parameters of interest when data from sensors contains high information value. This solution leads to the minimization of operational cycles when data changes according to a predictable trend. This solution offers a unique paradigm in contrast to the classic scenario of an embedded device obtaining data and then deciding whether the data contains information which should be stored and transmitted to a cloud. The presented DDSL method principally avoids redundant data acquisition, which leads to a more energy-efficient operation.
The proposed DDSL algorithm provides better results than the reference algorithm which operates with a continuous measurement period. The novel approach described in this paper achieved an approximately 40% higher PS than the reference algorithm. It means that our novel algorithm reached a lower RMSE at the same NoM as the reference algorithm, or a lower NoM at the same RMSE.
The presented solution opens several research opportunities. The first challenge includes application of the proposed method in another data domain. The next research challenge might be modification of the learning model. It is also possible to use statistical parameters as a reward policy to replace the polynomial function. In this article, the authors examined the general principle of the DDSL approach, which performs well on the presented mobile monitoring embedded devices, however future modification of the DDSL approach could lead to more effective domain-customized solutions.