Deep learning techniques, particularly convolutional neural networks (CNNs), have exhibited remarkable performance in solving vision-related problems, especially in unpredictable, dynamic, and challenging environments. In autonomous vehicles, imitation-learning-based steering angle prediction is viable due to the visual imagery comprehension of CNNs. In this regard, globally, researchers are currently focusing on the architectural design and optimization of the hyperparameters of CNNs to achieve the best results. Literature has proven the superiority of metaheuristic algorithms over the manual-tuning of CNNs. However, to the best of our knowledge, these techniques are yet to be applied to address the problem of imitation-learning-based steering angle prediction. Thus, in this study, we examine the application of the bat algorithm and particle swarm optimization algorithm for the optimization of the CNN model and its hyperparameters, which are employed to solve the steering angle prediction problem. To validate the performance of each hyperparameters’ set and architectural parameters’ set, we utilized the Udacity steering angle dataset and obtained the best results at the following hyperparameter set: optimizer, Adagrad; learning rate, 0.0052; and nonlinear activation function, exponential linear unit. As per our findings, we determined that the deep learning models show better results but require more training epochs and time as compared to shallower ones. Results show the superiority of our approach in optimizing CNNs through metaheuristic algorithms as compared with the manual-tuning approach. Infield testing was also performed using the model trained with the optimal architecture, which we developed using our approach.

In recent years, we have witnessed a storm of advancements in autonomous self-driving ground vehicles, and significant research efforts in the industry and academia are being devoted to their successful implementation. In this regard, one of the challenges identified is the accurate prediction of the steering angle required for a vehicle to autonomously steer on a given terrain, due to the heterogeneity of roads and their geometries. A recent solution has been proposed to address these challenges in steering angle prediction, that is, by learning from demonstration, also known as imitation learning [

In autonomous vehicles, imitation learning is used to learn the steering angle to drive in different scenarios through human driving demonstration. For this purpose, corresponding steering angles are collected simultaneously while driving the vehicle, and these are used as training data for the supervised learning by an artificial neural network (ANN) model. That is, the steering angle is predicted by the ANN model using raw image pixels as the input [

The process of designing the architecture of an ANN and tuning the hyperparameters to achieve optimal results for a particular problem has been identified to be strenuous and remains to be under investigation by researchers globally [

Literature has proven the competence of many nature-inspired metaheuristic algorithms, including the genetic algorithm [

The main contributions of this study are as follows:

Deploying bat and PSO algorithms for the automatic tuning of nine major architectural and training parameters of CNN, including the number of Conv layers, number of filters in each Conv layer, number of fully connected layers, number of nodes in the FC layers, batch size, dropout rate, activation function, optimizer, and learning rate.

Providing CNN model architectures and hyperparameters with improved results for the steering angle prediction problem.

Employing Udacity dataset in evaluating the performance of the metaheuristic-based optimization approach.

Validation of the results through infield testing of our developed prototype.

The rest of the paper is structured as follows: Section 2 outlines the literature review regarding the steering angle prediction and metaheuristic optimization algorithm. Section 3 gives a brief overview of the architecture, properties, and hyperparameters of the CNN. The standard bat algorithm is discussed in Section 4. In Section 5, a short introduction to the PSO algorithm is provided. The methodology of implementing the bat algorithm and the PSO algorithm for the CNN architectural units and other hyperparameter selections is elucidated in Section 6. Experiments and results are discussed in Section 7. Lastly, Section 8 concludes our research.

The study of steering angle prediction based on imitation learning began with the second Udacity competition related to the self-driving car. The winner of the competition developed 9-layered CNN, which was determined to be successful in autonomously driving the vehicle in the Udacity simulator. Since then, continuous research has been carried out on designing the optimal neural network architecture for steering angle prediction and finding hyperparameters that provide the best training results. Among these neural networks, the most commonly used architecture found in the literature is CNN.

Lately, many architectures of CNN have been proposed by researchers for steering angle prediction [

Recently, Kebria et al. [

In recent years, it has been established that CNNs can generate rich input image representations by embedding input images in fixed-length vectors, which can be used for a variety of visual tasks. However, CNN performance depends on datasets, architecture, and other training attributes [

The batch size, learning rate, number of epochs, and optimizer are among the crucial training parameters influencing the performance of a CNN architecture being trained on a given dataset, where one epoch indicates that the entire dataset has traversed forward and backward through the CNN once. The batch size is the total number of training examples present in a single batch when the whole dataset is divided into batches. The learning rate governs the step size at each iteration while moving toward a minimum of a loss function. The optimizer is responsible for altering the weights inside the CNN. The commonly used optimizers for CNN are the stochastic gradient descent (Sgd), Adam, Adagrad, Nesterov accelerated gradient (NAG), AdaDelta, and RMSProp.

For training the ANN to perform steering angle prediction, the corresponding steering angles and images are collected while a car is being driven by a human driver. For this purpose, three methods are being used. In the first method, simulation software (such as CARSIM, CARLA, and TORCS) [

Bat algorithm is a metaheuristic optimization algorithm proposed by Yang [^{th}_{i}_{i}^{th}^{t}_{gbest}^{t}_{i}_{i}^{o} is the initial pulse rate, r_{i}^{t+1}

The PSO algorithm is another metaheuristic-based optimization algorithm [_{i}^{th}_{1} & rand_{2} are random numbers, pbest is the best previous position of the i^{th}_{i}^{th}

^{th}

In this section, we present the methodology we adopted for the optimization of CNN using metaheuristic algorithms. In the following subsections, we describe the solution representation or encoding scheme, hyperparameter tuning of the CNN, and architecture optimization of the CNN. To verify the performance of the metaheuristic algorithms for automatically setting the architecture of CNN, we used a subset (2400 data samples for training and 600 for testing) of Udacity's publicly available dataset for the steering angle prediction [

Images and corresponding steering angles are provided to the CNN as input; once the model is trained it provides steering angle as output based on the image as input. To evaluate the performance of the solution at each step of our methodology, an effective evaluation function was required to select the optimal solution. In our methodology, the function to evaluate the fitness of a solution is the mean squared error (MSE) of the model. System parameters used for the optimization process include Keras with Tensorflow as Backend, 32 GB RAM, Core i9 CPU @ 5.20 GHz, and GTX 2080 GPU.

In this subsection, the procedure of employing the bat algorithm for the hyperparameter optimization for CNN is discussed. In Algorithm

The selected parameters for this step of optimization includes: A_{i}_{min}_{max}

S. No | Hyperparameters | Ranges |
---|---|---|

1 | Batch size | 1, 8, 16, 32, 64, 128 |

2 | Learning rate | 0.01–0.000001 |

3 | Activation functions | Relu, Elu, Leaky-Relu, SElu, Sigmoid |

4 | Optimizers | Sgd, Adam, Adagrad, NAG, AdaDelta, RMSProp |

At the initialization step, the position of each bat is assigned randomly, and the velocity of each bat is initialized as 0. We then initialize the pulse rate and loudness after evaluating the fitness of the position of each bat. The performance of each set of hyperparameters proposed by the bat algorithm was then validated by the MSE of a fixed CNN architecture. For this purpose, we have selected top-performing architecture from an existing work [

S.No | Hyperparameter setting | MSE on M1 | MSE on M2 |
---|---|---|---|

1 | Elu, Adagrad, batch size = 8, Learning_rate = 0.00052 | 0.0043 at epoch = 47 | 0.0091 at epoch = 35 |

2 | Elu, Adam, batch size = 32, Learning_rate = 0.0001 | 0.0060 at epoch = 29 | 0.0125 at epoch = 30 |

3 | Elu, Adam, batch size = 16, Learning rate = 0.0003 | 0.0065 at epoch = 24 | 0.0143 at epoch = 13 |

4 | Relu, Adagrad, batch size = 8, Learning_rate = 0.00032 | 0.0079 at epoch = 45 | 0.0148 at epoch = 48 |

With each hyperparameter setting, the model is trained for 50 epochs. To verify these hyperparameter settings, another top-performing model (denoted as M2) of the existing work [

For the next experiments of the CNN architecture optimization, we selected the second-best hyperparameter setting instead of the best setting, as it requires a relatively low number of epochs to provide satisfactory results.

In this subsection, the optimization of five architectural properties using the bat algorithm is explained, including the number of Conv layers, number of filters in each Conv layer, filter sizes in each layer, number of FC layers, and number of neurons in each FC layer; the ranges of these parameters in our approach are (3–13), (16–128), (3 × 3–7 × 7), (1–5), and (10–120), respectively. Algorithm _{min}_{max}_{i}_{i}

When we automate the architecture formation of the CNN, we can encounter dimensionality problems. Thus, if the output feature map is not padded, then depending on the size of the Conv filters, the dimension is reduced. Alternatively, if the layers are padded so that the input and dimensionality of the output feature map are preserved, the CNN architecture will not fit in the memory for training as the layers increase. Therefore, before connecting the FC layers, it is important to reduce the dimensionality. We have solved these dimensionality problems using the strategy depicted in Algorithm

In this subsection, the optimization of five architectural properties using the PSO algorithm is explained, including the number of Conv layers, number of filters in each Conv layer, filter sizes in each layer, number of FC layers, and the number of neurons in each FC layer. Position x_{i}_{i}

For the steering angle predictive neural network to be implemented practically, it needs to be robust. One way of achieving robustness is to train the CNN model with the extensive dataset containing various scenarios. To achieve this, the CNN model with the least MSE obtained after the optimization was retrained with more data (i.e., 25000 more data samples). The top hyperparameter setting obtained in Section 6.1 was used for training the CNN model. Afterward, infield testing was performed using the trained model to drive the vehicle on a road. The camera was mounted on the front, aligned at the exact center of the vehicle.

The Python library, named “OpenCV,” was used in capturing live frames (of dimension 640 × 480), which were continuously being utilized by the trained CNN model for the prediction of the steering angle. The dimensions of training, as well as testing frames, should be similar. Therefore, one-third of each frame from the live feed is cropped from the top and then rescaled to 300 × 300.

The average prediction time taken by our model on live camera frames was 0.1 s. The model gave continuous values within the range of −2 to 2, with 2 as a command to steer toward the extreme left and −2 as a command to steer toward the extreme right. Values close to 0, e.g., 0.0052, −0.00093, and 0.0102, are identified as commands to drive straight. The predicted steering angle is sent to Arduino ESP 32 DEVKIT V1 which controls the rotation of wheels through the help of a rotational potentiometer. The experiments were conducted for a total of 1 h of driving at different times of the day and different positions of the vehicle on the road. As per our findings, we found that the best prediction results of the model were obtained when the position of the vehicle was in the middle of the road. However, it did not perform well near the boundaries of the road, particularly at the right margins of the road. Such wrong predictions might be because, during the collection of the dataset, the vehicle was mostly driven at the center of the road. As the width of the road, lane marking color, and other properties of the testing track is different from the training track, the ability of the CNN model to generalize for new scenarios is convincible. In

This section provides details of our methodology and findings. We optimized the CNN via a two-step process. In the first step, we tuned four hyperparameters, and, in the second step, we optimized five architectural properties of the CNN. We began our experiment by tuning four hyperparameters for which a fixed CNN model was employed. Thereafter, the effectiveness of the tuned hyperparameter settings for other CNN architectures was verified by training another model (M2).

In the solution vector of the best bat in step 1 of our approach, we obtained a 0.00052 learning rate for the Adagrad optimizer. This is then used as the initial learning rate by the optimizer and is updated for every parameter based on the past gradients. We observed that the Adagrad optimizer consumed more training time and required a higher number of epochs as compared to those of Adam, which can be observed in

S.No | Models |
---|---|

B1 | Conv124--5x5--Conv119--5x5-- Pooling--2x2--Conv116--3x3 –Conv104--3x3--Pooling--2x2--Conv98--7x7--Conv82--7x7-- Pooling--2x2--Conv75--3x3-- Conv57--3x3--Pooling--2x2--Conv40--7x7--Conv29--7x7-- Pooling--2x2--Conv19--5x5-- Conv16--5x5--Dropout--Flatten--FC--104--FC--102--FC--75--FC--41--FC--1 |

B2 | Conv75--3x3--Conv71--7x7-- Pooling--2x2--Conv54--3x3 –Conv50--3x3-- Pooling--2x2--Conv44--5x5--Conv34--5x5--Pooling--2x2--Conv21--5x5--Dropout--Flatten--FC--88--FC--63--FC--39--FC--19--FC--1 |

B3 | Conv121--5x5--Conv114--3x3-- Pooling--2x2--Conv107--7x7 –Conv95--7x7-- Pooling-- 2x2--Conv83--3x3--Conv82--3x3--Pooling--2x2--Conv58--5x5--Conv27--5x5-- Pooling--2x2--Conv23--5x5--Conv22--7x7-- Pooling--2x2--Dropout--Flatten--FC--113--FC--41--FC--1 |

B4 | Conv114--7x7--Conv113--7x7-- Pooling--2x2--Conv98--5x5 --Conv90--5x5-- Pooling--2x2--Conv67--5x5--Conv58--5x5--Pooling--2x2--Conv51--5x5-- Conv38--5x5-- Pooling--2x2--Conv38--5x5-- Conv38--5x5-- Pooling--2x2--Conv38--5x5--Dropout--Flatten--FC--120--FC--120--FC--82--FC--50--FC--23--FC--1 |

B5 | Conv128--7x7--Conv115--7x7-- Pooling--Conv76--7x7-- Conv61--7x7-- Pooling--2x2--Conv56--7x7--Conv49--7x7--Pooling--2x2--Conv42--7x7--Conv37--7x7----Pooling--2x2--Conv36--7x7--Conv34--7x7--Pooling--2x2--Conv28--7x7--Conv17--7x7--Pooling--2x2--Conv17--7x7-- Dropout--Flatten--FC--110--FC--101--FC--11--FC—1 |

B6 | Conv113--5x5--Conv108--5x5--Pooling--2x2--Conv97--5x5--Conv85--5x5--Pooling--2x2--Conv74--5x5--Conv65--5x5--Pooling--2x2--Pooling--2x2--Conv53--5x5--Conv52--5x5-- Pooling--2x2--Conv38--5x5--Conv34--5x5-- Pooling--2x2--Dropout--Flatten--FC--119--FC--112--FC--85--FC--55--FC--19--FC--1 |

B7 | Conv111--5x5--Conv110--3x3--Pooling--2x2--Conv108--5x5--Conv95--5x5--Pooling--2x2--Conv86--3x3--Conv74--7x7-- Pooling--2x2--Conv72--3x3--Conv55--3x3-- Pooling--2x2--Conv48--7x7--Conv25--3x3-- Pooling--2x2--Conv22--7x7--Dropout--Flatten--FC--70--FC--49--FC--47--FC--41--FC--1 |

Models | MSE |
---|---|

B1 | 0.0012 |

B2 | 0.0015 |

M1 [ |
0.0033 |

M3 [ |
0.0030 |

It is useful to observe the cumulative effect of several layers of the CNN model on the performance of the model. Hence, we have plotted the MSE of models obtained through both optimization algorithms concerning the total count of layers (see

In the above figure, the number of layers refers to the sum of Conv and FC layers only. We have observed that the deep models perform better than the shallow ones.

After the process of training, each CNN model in our approach takes 300 x 300 RGB images as input and produces a real number as output in the range of −2 to 2. This number is the steering angle predicted by CNN with respect to a given image. Top CNN models obtained in our approach with the least MSEs have been depicted in

In the above table, a number of Conv filters used in each Conv layer are mentioned followed by the size of the filter. Filter size of each max-pooling layer is also mentioned followed by the word “Pooling”. Each architecture has a flattened layer between the Conv and FC layers. The number of neurons in each FC layer is mentioned right next to it. Moreover, each model ends with the FC layer having one neuron serving the purpose of the output layer. A dropout layer is added before the flatten layer in each model. We used the “ELU” activation function for each layer except for the last FC layer where a linear activation function was used. We have named models for further reference in

As the pooling layer with filter size 2 × 2 reduces the size of each feature map by a factor of 2, i.e., each dimension is halved, we cannot apply a pooling layer after each Conv layer. As the input image size used in our approach is 300 × 300, after seven applications of the pooling layer, each feature map dimension will then reduce to 2 × 2, whereas the maximum number of Conv layers used in our approach is 13. Therefore, we adopted the strategy of applying the pooling layer depicted in Algorithm _{size}

Although we can see in

Through careful observation of our findings, we can conclude that CNN models having a high number of layers perform better than the models with a low number of layers. This may not necessarily be true for Conv layers more than 13 and FC layers more than 5, because this is the maximum range we selected for our methodology. The evaluation criterion, i.e., MSE of the best CNN architecture obtained at the end of all iterations, articulates the effectiveness of our approach in optimizing the CNN architecture and other hyperparameters using the metaheuristic algorithms. We found, during our experimentation, that even though the deep models require more training time and epochs, they provide better results as compared to the shallow ones.

Deep learning models as opposed to traditional algorithms articulated with preformulated rules are efficient in representing the relationship between the response and its predictor. However, the problem space having all the possible combinations of real values for each parameter of these models is diverse; thus, it is infeasible to explore with exhaustive search as the number of layers increase. Therefore, we applied the metaheuristic algorithms to tune the CNN architectural parameters and other hyperparameters. Through careful observation of our findings, we can conclude that CNN models having a high number of layers perform better than the models with a low number of layers. This may not necessarily be true for Conv layers more than 13 and FC layers more than 5, because this is the maximum range we selected for our methodology. The evaluation criterion, i.e., MSE of the best CNN architecture obtained at the end of all iterations, articulates the effectiveness of our approach in optimizing the CNN architecture and other hyperparameters using the metaheuristic algorithms. We found, during our experimentation, that even though the deep models require more training time and epochs, they provide better results as compared to the shallow ones. In future, we would collect our own dataset and compare the performance of the CNN model trained on Udacity dataset with the same model trained on our own dataset. This would be done through the infield testing on same vehicle in different scenarios.