Optimizing Steering Angle Predictive Convolutional Neural Network for Autonomous Car

: Deep learning techniques, particularly convolutional neural networks (CNNs), have exhibited remarkable performance in solving vision-related problems, especially in unpredictable, dynamic, and challenging environments. In autonomous vehicles, imitation-learning-based steering angle prediction is viable due to the visual imagery comprehension of CNNs. In this regard, globally, researchers are currently focusing on the architectural design and optimization of the hyperparameters of CNNs to achieve the best results. Literature has proven the superiority of metaheuristic algorithms over the manual-tuning of CNNs. However, to the best of our knowledge, these techniques are yet to be applied to address the problem of imitation-learning-based steering angle prediction. Thus, in this study, we examine the application of the bat algorithm and particle swarm optimization algorithm for the optimization of the CNN model and its hyperparameters, which are employed to solve the steering angle prediction problem. To validate the performance of each hyperparameters’ set and architectural parameters’ set, we utilized the Udacity steering angle dataset and obtained the best results at the following hyperparameter set: optimizer, Adagrad; learning rate, 0.0052; and nonlinear activation function, exponential linear unit. As per our findings, we determined that the deep learning models show better results but require more training epochs and time as compared to shallower ones. Results show the superiority of our approach in optimizing CNNs through metaheuristic algorithms as compared with the manual-tuning approach. Infield testing was also performed using the model trained with the optimal architecture, which we developed using our approach.


Introduction
In recent years, we have witnessed a storm of advancements in autonomous self-driving ground vehicles, and significant research efforts in the industry and academia are being devoted to their successful implementation. In this regard, one of the challenges identified is the accurate prediction of the steering angle required for a vehicle to autonomously steer on a given terrain, due to the heterogeneity of roads and their geometries. A recent solution has been proposed to address these challenges in steering angle prediction, that is, by learning from demonstration, also known as imitation learning [1].
In autonomous vehicles, imitation learning is used to learn the steering angle to drive in different scenarios through human driving demonstration. For this purpose, corresponding steering angles are collected simultaneously while driving the vehicle, and these are used as training data for the supervised learning by an artificial neural network (ANN) model. That is, the steering angle is predicted by the ANN model using raw image pixels as the input [2][3][4]. Once the model learns, it can then autonomously predict steering angles without human intervention. For this type of supervised learning steering angle prediction problem, the convolutional neural network (CNN) and its variants are mostly employed due to their remarkable performance in visual imagery understanding [5]. However, the performance of these networks strongly depends on their architecture, design, and training parameters [6].
The process of designing the architecture of an ANN and tuning the hyperparameters to achieve optimal results for a particular problem has been identified to be strenuous and remains to be under investigation by researchers globally [7]. This becomes even more challenging when the dimensionality of the hyperparameter space increases. In particular, deep neural networks have different hyperparameters that need to be adjusted given any input dataset, giving rise to high-dimensional search space. The primary hyperparameters of the CNN that contribute to the accuracy of results and speed of convergence include the learning rate, optimizer function, number of epochs, batch size, activation function, dropout rate, number and sequence of layers (convolutional (Conv), pooling, and fully connected (FC) layers), number of neurons in the FC layers, and number and size of the filters in each Conv layer. Each setting, having a specific combination of all these hyperparameters, has a different impact on the performance of the neural network. The process of tuning all these hyperparameters and evaluating the results for each setting is tedious, time-consuming, and computationally expensive [8].
Literature has proven the competence of many nature-inspired metaheuristic algorithms, including the genetic algorithm [9][10][11][12], swarm intelligence optimization algorithms [13][14][15][16][17][18], and their variants, for the optimization of hyperparameters. The optimization of the hyperparameters of a steering angle predictive neural network has been determined to be a nonlinear, non-convex, and complex global optimization problem. To solve this kind of problem, the bat algorithm and its variants have shown efficient results as compared to other metaheuristic algorithms in the previous studies [19][20][21]. On these grounds, the prospect of the bat algorithm for tuning the CNN can be envisaged. However, the algorithm has never been used in tuning the hyperparameters of CNN. Moreover, for the steering angle prediction, various architectures of neural networks have been proposed by researchers; most of them have manually modified the hyperparameters of CNN to determine the combination(s) of values that afford the best results [22,23]. However, no research has been conducted on the utilization of metaheuristic algorithms for the optimization of steering angle predictive CNN. This motivated us to explore the application of two metaheuristic algorithms in realizing optimal steering angle predictive CNNs. Our research is divided into two steps. In the first step, we optimize the learning rate, batch size, activation function, and optimizer using the bat algorithm. In the second step, one of the optimal settings of these hyperparameters is selected to optimize five architectural units of CNN using the bat algorithm, namely, the number of Conv layers, number of filters in each Conv layer, size of each filter, number of FC layers, and number of neurons in each FC layer. Moreover, these five architectural parameters are also optimized using the basic Particle Swarm Optimization (PSO) algorithm. This is performed to determine the effectiveness of the bat algorithm in solving the problem under discussion by comparing the results of the bat algorithm and the PSO algorithm. For this purpose, the same number of generations, population size, and encoding scheme of the CNN architecture is used. During whole process of optimization in our methodology, we adapted early stopping, which involves stopping the training after some epochs if the model failed to improve. Finally, the optimal CNN model architecture is trained with more data and used in infield experiments.
The main contributions of this study are as follows: • Deploying bat and PSO algorithms for the automatic tuning of nine major architectural and training parameters of CNN, including the number of Conv layers, number of filters in each Conv layer, number of fully connected layers, number of nodes in the FC layers, batch size, dropout rate, activation function, optimizer, and learning rate. • Providing CNN model architectures and hyperparameters with improved results for the steering angle prediction problem. • Employing Udacity dataset in evaluating the performance of the metaheuristic-based optimization approach. • Validation of the results through infield testing of our developed prototype.
The rest of the paper is structured as follows: Section 2 outlines the literature review regarding the steering angle prediction and metaheuristic optimization algorithm. Section 3 gives a brief overview of the architecture, properties, and hyperparameters of the CNN. The standard bat algorithm is discussed in Section 4. In Section 5, a short introduction to the PSO algorithm is provided. The methodology of implementing the bat algorithm and the PSO algorithm for the CNN architectural units and other hyperparameter selections is elucidated in Section 6. Experiments and results are discussed in Section 7. Lastly, Section 8 concludes our research.

Related Work
The study of steering angle prediction based on imitation learning began with the second Udacity competition related to the self-driving car. The winner of the competition developed 9-layered CNN, which was determined to be successful in autonomously driving the vehicle in the Udacity simulator. Since then, continuous research has been carried out on designing the optimal neural network architecture for steering angle prediction and finding hyperparameters that provide the best training results. Among these neural networks, the most commonly used architecture found in the literature is CNN.
Lately, many architectures of CNN have been proposed by researchers for steering angle prediction [24][25][26][27][28][29]. By analysis of available literature regarding the imitation-learning-based steering angle prediction, it was observed that the performance of a neural network architecture proposed in one research cannot be compared with an architecture proposed in another research. This is because the training and testing datasets used in each research is different. This persists even with the studies in which the same publicly available datasets were used. This is because has not been divided into training and testing sets by the publisher, so the train-test-split is different in different studies. Moreover, the evaluation metrics used in these studies vary from each other. Hence, there is an ambiguity in deciding which architecture of CNN and its hyperparameters should be used for training to utilize CNN for infield autonomous vehicles.
Recently, Kebria et al. [30] manually tuned 96 CNN models and equally trained on a subset of the Udacity dataset to explore and evaluated the impact of three architectural parameters of CNN, including the number of layers, number of filters in each layer, and the filter size on the performance of CNN for steering angle prediction. For this purpose, they incrementally increased the number of layers from 3 to 18, the number of filters from 4 to 128, and the filter sizes from 3 × 3 to 7 × 7, for designing CNN models. However, they did not perform infield testing to evaluate the performance of the best performing CNN model architecture obtained after the hand-tuning of architectures. Moreover, tuning CNN using a metaheuristic approach has proven its competence as compared to manually tuning it [31]. Jaddi et al. [32] have used a bat algorithm to optimize the architecture, as well as the weights and biases of simple feedforward neural networks. They tested the effectiveness of their approach using two-time series and six classification benchmark datasets. However, the process of tuning CNN is quite complex and challenging due to its disparate layers and dimensionality issues.

Convolutional Neural Network
In recent years, it has been established that CNNs can generate rich input image representations by embedding input images in fixed-length vectors, which can be used for a variety of visual tasks. However, CNN performance depends on datasets, architecture, and other training attributes [33,34]. The basic architecture of a CNN comprises input and output layers, as well as multiple hidden layers. CNN's hidden layer consists of a series of convolutional layers (Conv layer), a pooling layer, normalization layer(s), flatten layer, and fully connected layer(s). Central to the CNN are the Convolutional layers (Conv layer), in which filters convolve over the image and perform dot product with the image pixels. This operation is very important for indexes in a matrix as it affects how weights are determined at a particular index point. The output array obtained as a result of convolution is called a feature map. The number of feature maps yielded by a Conv layer equals the number of filters used in the layer. The number of parameters and output volume of a layer depends on the input size, filter size, number of filters, padding, and stride of filters. It is a convention to apply the activation layer immediately after each convolution layer. The objective of this layer is to incorporate nonlinearity into a system. Some commonly used nonlinear activation functions include Rectified Linear units (Relu), Leaky Relu, parametrized Relu, Elu, Selu, Sigmoid, Softmax, and Tanh. Where softmax and sigmoid are common to the output layer for the classification problem.
The batch size, learning rate, number of epochs, and optimizer are among the crucial training parameters influencing the performance of a CNN architecture being trained on a given dataset, where one epoch indicates that the entire dataset has traversed forward and backward through the CNN once. The batch size is the total number of training examples present in a single batch when the whole dataset is divided into batches. The learning rate governs the step size at each iteration while moving toward a minimum of a loss function. The optimizer is responsible for altering the weights inside the CNN. The commonly used optimizers for CNN are the stochastic gradient descent (Sgd), Adam, Adagrad, Nesterov accelerated gradient (NAG), AdaDelta, and RMSProp.
For training the ANN to perform steering angle prediction, the corresponding steering angles and images are collected while a car is being driven by a human driver. For this purpose, three methods are being used. In the first method, simulation software (such as CARSIM, CARLA, and TORCS) [35,36], is being used to generate the data sets required for this task. In the second method, images and corresponding steering angles are collected by a human-driven vehicle using onboard cameras and angle sensors [37,38]. Lastly, the third method involves utilizing the benchmark publicly available datasets (such as Udacity, DIPLECS, and Comma.ai) [39], for training ANNs to perform autonomous steering angle prediction.

Bat Algorithm
Bat algorithm is a metaheuristic optimization algorithm proposed by Yang [40], and it is based on the echolocation behavior of microbats. Bats are known to release a very loud sound pulse; subsequently, they find their prey based on the echoes returned from objects. Each pulse emitted has a specific frequency and loudness, based on which the velocity and position of the bat are adjusted. As the bat approaches its prey, the pulse rate increases, while the loudness decreases. In the bat algorithm, this bat mechanism for finding prey is imitated to determine the optimal solution. Bat algorithm involves a sequence of iterations, where the population of bats represents the candidate solutions, which, in turn, are updated using the frequency and velocity. The frequency, velocity, and position of the solutions are calculated based on the following equations: where η is a random number in the range of (0, 1); is the frequency of the i th bat that controls the range and speed of movement of the bats; x i and vel i denote the position and velocity of the i th bat, respectively; and x t gbest stands for the current global best position at time step 't'. The next step is to check if the random number is less than the pulse rate of the bat. If true, a random walk is applied around the best solutions through a local search using Eq. (4). This is performed to enhance the diversity of the solutions.
where A t denotes the average loudness of all bats so far and α ∈ [1, 1] is a random number that controls the direction of the random walk. The pulse rate is observed to increase when a bat finds its prey, while the loudness typically decreases. The loudness (A i ) and pulse rate (r i ) are updated using the following equations: where r o is the initial pulse rate, r i t+1 is the pulse rate of bat computed for the next step and β & μ are constant values. The pulse rate and loudness of the bats are updated only if a new solution is accepted.

PSO Algorithm
The PSO algorithm is another metaheuristic-based optimization algorithm [41]. The optimal solution in PSO is searched based on the social behavior of fish schools and bird flocks. Each particle moves within the search space through collaboration with other particles while balancing exploration and exploitation. The PSO algorithm involves a sequence of iterations, where a population of particles represents candidate solutions, which are updated using the following equations: where vel i is the velocity of the i th particle, c1 & c2 are positive constants, rand 1 & rand 2 are random numbers, pbest is the best previous position of the i th particle, gbest is the global best position explored thus far in the entire swarm of particles, and x i is the current position of the i th particle.
Eq. (7) updates the velocity of the i th particle, which is used in Eq. (8) to update the position of the particle.

Methodology and Experiments
In this section, we present the methodology we adopted for the optimization of CNN using metaheuristic algorithms. In the following subsections, we describe the solution representation or encoding scheme, hyperparameter tuning of the CNN, and architecture optimization of the CNN. To verify the performance of the metaheuristic algorithms for automatically setting the architecture of CNN, we used a subset (2400 data samples for training and 600 for testing) of Udacity's publicly available dataset for the steering angle prediction [42]. This dataset consists of images and corresponding steering angles in the range of −2 to 2. As in the imitation learning-based steering angle prediction approach, the neural network model learns to predict the steering angle depending on the shape of the road, the images with no road, may result in learning wrong features. Hence, to obtain more robust results, we removed the frames in the dataset which were captured when the car was parked and when the camera view only covers other vehicle instead of any road. After this step, cropping of images was performed. One third of each frame was cropped from the top in order to remove unwanted data and subsequently rescaled to 300 × 300. Some samples of the Udacity dataset after performing the cropping and rescaling operation have been shown in the Fig. 1a; and samples in the dataset which were removed because they did not contain useful road area information have been shown in Fig. 1b. Images and corresponding steering angles are provided to the CNN as input; once the model is trained it provides steering angle as output based on the image as input. To evaluate the performance of the solution at each step of our methodology, an effective evaluation function was required to select the optimal solution. In our methodology, the function to evaluate the fitness of a solution is the mean squared error (MSE) of the model. System parameters used for the optimization process include Keras with Tensorflow as Backend, 32 GB RAM, Core i9 CPU @ 5.20 GHz, and GTX 2080 GPU.

Optimizing the Hyperparameters of the CNN Using the Bat Algorithm
In this subsection, the procedure of employing the bat algorithm for the hyperparameter optimization for CNN is discussed. In Algorithm 1, an overview of our proposed bat algorithm is presented. Generate a local solution around the selected solution in step 18 using Eq. (4) 20) endif 21) if the currently generated solution has not been evaluated in the past, then 22) if rand2 < A i , then 23) Train the CNN model with the current solution and evaluate its fitness (x i ) 24) if the fitness (x i ) < fitness (xbest), then 25) Increase r i and reduce A i using Eqs. (5) and (6)  26) Compare fitness (x i ) & fitness (xbest), and find the current best 27) Update xbest 28) endif 29) endif 30) endfor The selected parameters for this step of optimization includes: A i = 0.3, freq min = 0, freq max = 1, β = 0.5, and μ = 0.5. In our approach, the initialization of the pulse rate for each bat is in increasing order to the fitness of each bat. This is done to reduce the probability of solutions with better fitness to perform random local searches around other top best solutions. This strategy may help reduce the probability of early convergence which is the main limitation of the bat algorithm. We found through experiments that nudges of parameters in the vicinity of better fit CNN models give improved results. Therefore, the pulse rate initialization in increasing order with respect to fitness will lower the prospect of overlooking the problem space in the vicinity of better fit models. The process of optimization starts by initializing a population of 10 bats thereafter, the position of each bat is adjusted based on the velocity and frequency, which are updated at each iteration using the standard bat algorithm. The position of each bat represents the hyperparameter setting for a CNN model, i.e., a real-valued vector representing the batch size, activation function, learning rate, and optimizer. The activation function and optimizer are then mapped to real numbers, after which all the dimensions of each bat are encoded in the same range. Tab. 1 presents the search space for each dimension of the bats. At the initialization step, the position of each bat is assigned randomly, and the velocity of each bat is initialized as 0. We then initialize the pulse rate and loudness after evaluating the fitness of the position of each bat. The performance of each set of hyperparameters proposed by the bat algorithm was then validated by the MSE of a fixed CNN architecture. For this purpose, we have selected top-performing architecture from an existing work [30] (we name it 'M1'), to optimize four hyperparameters of CNN, namely, the batch size, learning rate, activation function, and the optimizer. After 20 iterations on a population of 10 bats, the top 4 high-performing hyperparameter settings obtained at the last iteration are depicted in Tab. 2.  For the next experiments of the CNN architecture optimization, we selected the second-best hyperparameter setting instead of the best setting, as it requires a relatively low number of epochs to provide satisfactory results.

Optimizing the CNN Architecture Using the Bat Algorithm
In this subsection, the optimization of five architectural properties using the bat algorithm is explained, including the number of Conv layers, number of filters in each Conv layer, filter sizes in each layer, number of FC layers, and number of neurons in each FC layer; the ranges of these parameters in our approach are (3-13), (16-128), (3 × 3-7 × 7), (1)(2)(3)(4)(5), and (10-120), respectively. Algorithm 1 is adopted to optimize the CNN architecture, with pop_size = 10, numofgen = 20, freq min = 1, freq max = 4, initial loudness = 4, dimension (D) = 5, β = 0.9, μ = 0.9, and pulse rate is initialized in an ascending order with respect to the fitness of the bats. Position x i of each bat represents a particular CNN architecture, and each dimension of the bat corresponds to the CNN architectural property. At the initialization step, position x i of each bat is then initialized randomly in the range specified for the respective dimension. Subsequently, each dimension is rescaled in the range of 10-250 for further processing in the next iterations. The velocity of each bat is initialized as 0, after which it is adjusted using Eq. (2) as mentioned in Section 4 of this paper. The velocity is bound to the range of −18 to 18.
When we automate the architecture formation of the CNN, we can encounter dimensionality problems. Thus, if the output feature map is not padded, then depending on the size of the Conv filters, the dimension is reduced. Alternatively, if the layers are padded so that the input and dimensionality of the output feature map are preserved, the CNN architecture will not fit in the memory for training as the layers increase. Therefore, before connecting the FC layers, it is important to reduce the dimensionality. We have solved these dimensionality problems using the strategy depicted in Algorithm 2.

Algorithm 2:
Procedure for reducing the dimensionality 1) if the number of layers <= 5 2) for the total number of Conv layers, do 3) apply the Conv filters on the activation map returned by the previous layer of CNN 4) apply a max-pooling layer with filter size 2 × 2 5) end for 6) else (Continued)

7)
for the total number of Conv layers, do 8) apply the Conv filters on the activation map returned by the previous layer of CNN 9) if the layer number is a multiple of 3 10) apply max-pooling layer with filter size 2 × 2 11) end if 12) end for 13) end if

Optimizing the CNN Architecture Using the PSO Algorithm
In this subsection, the optimization of five architectural properties using the PSO algorithm is explained, including the number of Conv layers, number of filters in each Conv layer, filter sizes in each layer, number of FC layers, and the number of neurons in each FC layer. Position x i of each particle represents a particular CNN architecture, and each dimension of the particle corresponds to the CNN architectural property. At the initialization step, position x i of each particle is initialized the same as the first population of the bat algorithm in the previous step. This is carried out to ensure the same average fitness of individuals at the first iteration of the PSO and bat algorithms. After the initialization of the particle positions, each dimension is then rescaled in a range of 10-250 for further processing in the next iterations. The velocity of each particle is initialized as 0, which is then adjusted using Eq. (7). The velocity is bound to the range of −18 to 18. Most researchers obtained optimal results with a value of '2' for c1 and c2 [43,44]. Therefore, in our methodology, this value is selected for the PSO algorithm. The encoding scheme for each position of particles is the same as that used in the bat algorithm. Each model is trained for 50 epochs using the top-second hyperparameter setting obtained in step 1 of our methodology implemented in Section 6.1. Algorithm 3 presents the methodology of the application of the PSO algorithm for the hyperparameter optimization of CNN.

Algorithm 3:
Proposed PSO-algorithm based CNN architecture optimization algorithm 1) Set the number of generations, numOfgen ← 20 2) Set the population size, pop_size ← 10 3) Set the number of dimensions of each particle, D ← 4 4) Set c1 ← 2 5) Set c2 ← 2 6) Set the upper and lower boundaries of each dimension of the particles 7) Set the initial population 8) Train the CNN models with the hyperparameters suggested in the initial population 9) Evaluate the fitness of all the CNN models trained in the previous step 10) Set the best particle gbest based on the fitness 11) Set the initial velocity of the population, v 0 ← 0 12) for t in the range of pop_size, do: 13) Generate new solutions by adjusting the velocity (v i ) and position (x i ) of the i th particle using Eqs. (7) and (8)  14) Train the CNN model with the current solution, and evaluate its fitness 15) if the current fitness is better than the pbest fitness, then 16) Set the pbest position as the current position of the particle 17) Set the pbest fitness as the current fitness of the particle 18) end if (Continued) 19) if the current fitness is better than the global best fitness, then 20) Set the gbest position as the current position of the particle 21) Set the gbest fitness as the current fitness of the particle 22) end if 23) end for

Infield Experiments
For the steering angle predictive neural network to be implemented practically, it needs to be robust. One way of achieving robustness is to train the CNN model with the extensive dataset containing various scenarios. To achieve this, the CNN model with the least MSE obtained after the optimization was retrained with more data (i.e., 25000 more data samples). The top hyperparameter setting obtained in Section 6.1 was used for training the CNN model. Afterward, infield testing was performed using the trained model to drive the vehicle on a road. The camera was mounted on the front, aligned at the exact center of the vehicle.
The Python library, named "OpenCV," was used in capturing live frames (of dimension 640 × 480), which were continuously being utilized by the trained CNN model for the prediction of the steering angle. The dimensions of training, as well as testing frames, should be similar. Therefore, one-third of each frame from the live feed is cropped from the top and then rescaled to 300 × 300. Fig. 3 shows some examples of the original and equivalently cropped/rescaled testing and training frames. The average prediction time taken by our model on live camera frames was 0.1 s. The model gave continuous values within the range of −2 to 2, with 2 as a command to steer toward the extreme left and −2 as a command to steer toward the extreme right. Values close to 0, e.g., 0.0052, −0.00093, and 0.0102, are identified as commands to drive straight. The predicted steering angle is sent to Arduino ESP 32 DEVKIT V1 which controls the rotation of wheels through the help of a rotational potentiometer. The experiments were conducted for a total of 1 h of driving at different times of the day and different positions of the vehicle on the road. As per our findings, we found that the best prediction results of the model were obtained when the position of the vehicle was in the middle of the road. However, it did not perform well near the boundaries of the road, particularly at the right margins of the road. Such wrong predictions might be because, during the collection of the dataset, the vehicle was mostly driven at the center of the road. As the width of the road, lane marking color, and other properties of the testing track is different from the training track, the ability of the CNN model to generalize for new scenarios is convincible. In Fig. 4, some of the scenarios with correct as well as wrong prediction results are shown. This section provides details of our methodology and findings. We optimized the CNN via a twostep process. In the first step, we tuned four hyperparameters, and, in the second step, we optimized five architectural properties of the CNN. We began our experiment by tuning four hyperparameters for which a fixed CNN model was employed. Thereafter, the effectiveness of the tuned hyperparameter settings for other CNN architectures was verified by training another model (M2).
In the solution vector of the best bat in step 1 of our approach, we obtained a 0.00052 learning rate for the Adagrad optimizer. This is then used as the initial learning rate by the optimizer and is updated for every parameter based on the past gradients. We observed that the Adagrad optimizer consumed more training time and required a higher number of epochs as compared to those of Adam, which can be observed in Tab. 4. Therefore, for the experiments of the CNN architecture optimization, we selected the second-best hyperparameter setting instead of the top-most setting because it provides a satisfactory estimate of the performance of a model with fewer epochs. For the comparison of the PSO and Bat algorithms for the CNN architecture optimization, the average fitness of the individuals in a population is plotted at each iteration (see Fig. 5).      [30] 0.0033 M3 [1] 0.0030 It is useful to observe the cumulative effect of several layers of the CNN model on the performance of the model. Hence, we have plotted the MSE of models obtained through both optimization algorithms concerning the total count of layers (see Fig. 6).  After the process of training, each CNN model in our approach takes 300 x 300 RGB images as input and produces a real number as output in the range of −2 to 2. This number is the steering angle predicted by CNN with respect to a given image. Top CNN models obtained in our approach with the least MSEs have been depicted in Tab. 3.
In the above table, a number of Conv filters used in each Conv layer are mentioned followed by the size of the filter. Filter size of each max-pooling layer is also mentioned followed by the word "Pooling". Each architecture has a flattened layer between the Conv and FC layers. The number of neurons in each FC layer is mentioned right next to it. Moreover, each model ends with the FC layer having one neuron serving the purpose of the output layer. A dropout layer is added before the flatten layer in each model. We used the "ELU" activation function for each layer except for the last FC layer where a linear activation function was used. We have named models for further reference in Tab. 3 as B1, B2, up to B7.
As the pooling layer with filter size 2 × 2 reduces the size of each feature map by a factor of 2, i.e., each dimension is halved, we cannot apply a pooling layer after each Conv layer. As the input image size used in our approach is 300 × 300, after seven applications of the pooling layer, each feature map dimension will then reduce to 2 × 2, whereas the maximum number of Conv layers used in our approach is 13. Therefore, we adopted the strategy of applying the pooling layer depicted in Algorithm 3. According to this, if the number of Conv layers proposed by the either of Bat or PSO algorithm in the current solution vector is less than or equal to 5, then max pooling is applied after each Conv layer, else it is applied after two Conv layers. Another precautionary measure taken in our approach to abolishing dimensionality problems is zero paddings of the Conv layers. The output of the Conv layer is determined by the formula: where W is the input volume, F size is the filter size, P is the padding, and S is the stride. In our case, the stride of 1 is applied throughout the experiments. We applied varied paddings on layers depending on the filter size. That is, padding of 2 × 2 is applied if the filter size is 3 × 3, 3 × 3 if the filter size is 5 × 5, and 5 × 5 if the filter size is 7 × 7. Therefore, the output size of the Conv layer in our approach remains the same as the input layer size.
Although we can see in Fig. 5 that the models are still improving at the last epochs using the Adagrad optimizer without overfitting, we trained all the models for 50 epochs during the optimization because of time constraints. However, improved performance could be obtained using more epochs on the Adagrad optimizer. Therefore, to revalidate our results, we trained the best-performing model architecture with the Adagrad optimizer for 200 epochs. The results are then compared with existing hand-tuned steering angle predictive CNN models (M1 and M3) from previous studies [1,30]. Model M1 has already been defined in Section 6.1; M3 is a 9-layered CNN model containing 5 Conv and 4 FC layers. Tab. 4 shows the comparative results of the top two best-performing models obtained in our experiments (B1 and B2), with the best-performing models of the previous study (M1 and M3).
Through careful observation of our findings, we can conclude that CNN models having a high number of layers perform better than the models with a low number of layers. This may not necessarily be true for Conv layers more than 13 and FC layers more than 5, because this is the maximum range we selected for our methodology. The evaluation criterion, i.e., MSE of the best CNN architecture obtained at the end of all iterations, articulates the effectiveness of our approach in optimizing the CNN architecture and other hyperparameters using the metaheuristic algorithms. We found, during our experimentation, that even though the deep models require more training time and epochs, they provide better results as compared to the shallow ones.

Conclusions
Deep learning models as opposed to traditional algorithms articulated with preformulated rules are efficient in representing the relationship between the response and its predictor. However, the problem space having all the possible combinations of real values for each parameter of these models is diverse; thus, it is infeasible to explore with exhaustive search as the number of layers increase. Therefore, we applied the metaheuristic algorithms to tune the CNN architectural parameters and other hyperparameters. Through careful observation of our findings, we can conclude that CNN models having a high number of layers perform better than the models with a low number of layers. This may not necessarily be true for Conv layers more than 13 and FC layers more than 5, because this is the maximum range we selected for our methodology. The evaluation criterion, i.e., MSE of the best CNN architecture obtained at the end of all iterations, articulates the effectiveness of our approach in optimizing the CNN architecture and other hyperparameters using the metaheuristic algorithms. We found, during our experimentation, that even though the deep models require more training time and epochs, they provide better results as compared to the shallow ones. In future, we would collect our own dataset and compare the performance of the CNN model trained on Udacity dataset with the same model trained on our own dataset. This would be done through the infield testing on same vehicle in different scenarios.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.