Performance Comparison of Deep CNNModels for Detecting Driver’s Distraction

According to various worldwide statistics, most car accidents occur solely due to human error. The person driving a car needs to be alert, especially when travelling through high traffic volumes that permit high-speed transit since a slight distraction can cause a fatal accident. Even though semiautomated checks, such as speed detecting cameras and speed barriers, are deployed, controlling human errors is an arduous task. The key causes of driver’s distraction include drunken driving, conversing with co-passengers, fatigue, and operating gadgets while driving. If these distractions are accurately predicted, the drivers can be alerted through an alarm system. Further, this research develops a deep convolutional neural network (deep CNN) models for predicting the reason behind the driver’s distraction. The deep CNN models are trained using numerous images of distracted drivers. The performance of deepCNNmodels, namely theVGG16,ResNet, andXception network, is assessed based on the evaluation metrics, such as the precision score, the recall/sensitivity score, the F1 score, and the specificity score. The ResNet model outperformed all other models as the best detection model for predicting and accurately determining the drivers’ activities.

road safety. Predicting the reasons for the driver's distraction and possibly alerting the driver could avoid such accidents. Further, this work devises the tools and methods to determine the best and most efficient Deep Convolutional Neural Network (deep CNN) model for detecting the reason behind a driver's distraction. The deep CNNs have proven to perform exceptionally well in classifying images; thus, it seems to be an excellent fit for resolving this problem.
A deep CNN usually requires significantly less preprocessing than the other classification algorithms [5][6][7][8][9][10]. The entire process of finding the best deep CNN model begins with comparing the models in terms of different evaluation metrics and selecting the best among them. The deep CNN models help to classify the distracted driver dataset. Further, this system would ensure road safety in high-risk roads and highways, where speed is also a concern, and the fatality rate is much higher. Even though the external checks are essential for curbing accidents, predicting the driver's distraction plays a significant role in saving lives and guaranteeing road safety.
This research determines an optimized approach among different deep CNN models for detecting the driver's distraction. The various models' performances were compared using the evaluation metrics, and then the best-suited approach was determined based on these metrics. The materials and methods section deals with the background concepts and related works on this topic, and it briefly introduces the deep CNN models. The implementation section discusses the hardware and software requirements, the dataset utilized, and the individual deep CNN models' parameter settings. Next, the results and discussions section provides the performance comparisons of various deep CNN models. Finally, the conclusion section summarizes this work along with a brief discussion about possible future enhancements.

The Deep Convolutional Neural Network (Deep CNN)
The concept of image recognition, classification and processing has evolved through various architectures and algorithms, and deep CNN models are a branch of Deep Learning [11]. Firstly, the images get converted into the two-dimensional matrix [12][13][14][15]. However, this reduces the quality of the image when it has pixel dependencies. The deep CNN algorithm ensures that the image quality and its spatial and temporal dependencies are also preserved. A deep CNN model trained on a larger dataset usually generalizes much better than a model trained with a smaller dataset. Further, the deep CNN model processes the images with minimum computation and minimal damage to the pixel values. The entire process of the deep CNN image classification can be broadly divided into three steps. The image passes through the convolutional layers, the pooling layers, and the Fully Connected Layers [16]. Finally, a probabilistic function is applied to classify the images. Various deep CNN architectures such as LeNet, AlexNet, VGGNet, ResNet, and Xception can be deployed for image classification. This work focuses on three prominent deep CNN architectures: the ResNet, Xception, and the VGG16 model.

The ResNet Model
Generally, in deep CNN models, the classification efficiency keeps improving proportionately with the number of network layers. However, this causes a consequent increase in the training and testing error rate. This phenomenon is referred to as the vanishing or exploding gradient. Further, this issue can be resolved using the Residual Network (ResNet) [17][18][19]. These networks deploy an approach known as skip connections. Further, the network skips the training from a few layers and connects directly to the output. ResNet's basic architecture is inspired by the VGG network, where the convolutional layers use 3 × 3 filters. The architecture involves two concepts for model optimization. The layers possess the same number of filters for the same type of output feature maps. Moreover, when the output feature map's size is halved, the number of filters is doubled to preserve each layer's time complexity [20][21][22]. In this work, the ResNet model was trained and tested over the Kaggle dataset for Distracted Driver Detection by State Farm. Moreover, this model efficiently classifies the driver's distraction. Fig. 1 portrays the architecture for the ResNet model that consists of 152 layers. Each step is carried forward with four layers of similar behavioural pattern in a ResNet. Every subsequent segment follows the same pattern. A three-by-three convolution is performed with a constant dimension: 64, 128, 256, and 512, respectively. Thus, it bypasses the input after every two convolutions. Moreover, the width and height dimensions during the entire layer remain constant. Skip connections perform identity mapping, and their outputs are added to the outputs of the stacked layers. Furthermore, the ResNet model is less complicated and can be easily optimized compared to the other networks. Besides, this model converges faster and generates better results than other peer-level networks.

The Xception Model
The Extreme Inception or the Xception model is an inspired version of CNN's Inception model, an 'extreme' improvement. The Inception model has deep convolutional layers and wider convolutional layers that work in a parallel manner. This model has two different levels, each with three convolutional layers. Unlike the inception model, the Xception model has two levels, where one of them has a single layer. This layer slices the output into three segments and passes it on to the next set of filters. The first level has a single convolutional level of 1 * 1 filter, while the next level has three convolutional levels of a 3 * 3 filter. The aspect that defines the Xception model is the Depthwise Separable Convolution [23][24][25]. A general deep CNN model takes care of spatial and channel distribution, but the Xception model involves depthwise and pointwise convolution. The work by Chollet [26] shows the improvement of Xception over the previous models. This research uses this Xception model to evaluate the distracted driver dataset for classifying the driver's distraction. The architecture of the Xception network model is illustrated in Fig. 2. The Xception model is a 71-layer deep CNN, inspired by the Inception model from Google, and it is based on an extreme interpretation of the Inception model [27]. Its architecture is stacked with depthwise separable convolutional layers. The pre-trained version of the model is trained using millions of images from the Imagenet database. Moreover, this model can classify hundreds of object categories and has rich representations of its utilities for a wide range of pictures. The Xception model has profound utilities in the domains of image identification and classification.

The VGG16 Model
The VGG16 architecture is an improved version of the AlexNet deep CNN model. When this model was tested over the Imagenet dataset, it showed a top-5 test accuracy of 92.7%. The VGG16 model uses 16 layers with tunable parameters. There are 13 convolutional layers and three fully connected layers. It also contains five max-pooling layers in the middle, and at the output, it has the Softmax activation function [28][29][30]. The entire module's architecture is divided into various sets of convolutional layers and max-pooling layers, following which the fully connected layer and the activation function are present. In the VGG16 model, the image passes through two sets of two convolutional layers and one max pooling layer. Subsequently, it is followed by three sets of three convolutional layers and one max pooling layer. After this stage, the image passes through the three dense, fully connected layers, finally entering the Softmax activation function [31].
The VGG16 model also has hidden layers with the Rectified Linear Unit (ReLU) as the activation function. This model happens to be less computationally intensive than the previous ones due to the decrease in kernels. Besides that, the convolutional layer preserves the image resolution as it has a small receptive field, that of 3 * 3, and a stride of 1. Fig. 3 represents the architecture of the VGG16 model. The input of the first convolution layer is of a definite size and a specific fixated RGB image. The picture moves across many network layers, utilizing the filters with a minimal 3 * 3-pixel responsive field. The stride of convolution is fixated at a pixel, and the in-space resolution is saved even after the convolution [32].
For the 3 * 3 convolutional layers, one layer of zeros gets added to the borders for the same padding. The max-pooling function is performed across a 2 * 2-pixel window, with a stride of 2. Three fully connected layers follow a stack of convoluting sheets, with the final layer being the Softmax layer. The fully connected layer configuration is similar in every network, and every hidden layer is provided with the ReLU activation function.

Model Comparison
This research presents an accurately trained model for classifying the driver's distraction. The rate of fatal accidents due to the driver's human error or negligence has been at a record high for the past few years. Accidents can be prevented by alerting drivers whenever they tend to get distracted. The input provided for training the system is the distracted driver's images, such as the driver using a mobile phone, adjusting radio channels, drinking, and/or engaged in other such activities [33]. This dataset will then train the various deep CNN algorithms, and the best model for this task is determined. For increasing distraction levels, the model proportionately recognizes a wide range of distracted drivers better while eliminating the non-distracted ones. The deep CNN algorithms require minimal preprocessing of the data; also, they can capture the spatial and temporal dependencies in images. However, basic preprocessing methods are still needed to ensure that the dataset does not provide irrelevant details. The RGB images are converted into the grey-scale format, where a two-dimensional matrix structure represents each image. The images' thresholding is necessary due to the car seats' background noise. Thresholding ensures the extraction of only the relevant part(s) from the image-characterizing the driver's distraction. The primary image processing methods guarantee the obtained image's appropriateness and contribute to the dataset's variety. Fig. 4 shows this work's methodological flow. As mentioned earlier, deep CNN architecture provides various image classification algorithms and models. We used three models: ResNet, Xception, and VGG16. These models were trained separately using the distracted driver dataset. Further, various evaluation metrics were employed to assess these models' performance. The best model was decided based on the evaluation metrics. To this end, the ResNet was observed to be the best model for performing a successful driver's distraction classification.

Hardware Requirement
The system was executed on a Hewlett-Packard (HP) Spectre ×360 convertible workstation with a 64-bit Intel® Core™i7 processor and a GPU. It had 16 GB RAM and a 64-bit operating system with touch and pen input supports. The camera used in this system was an HP TrueVision Full HD WVA Webcam that comes inbuilt with the workstation and interspersed with dual digital microphones.

Software Requirement
The software applications used for this system included a Python platform and R-Studio. The system was built primarily on the Python language along with secondary support from R programming. Several Python libraries like NumPy, Keras, TensorFlow, Pandas, and Matplotlib were used to implement the deep CNN models. Further, these models were executed using opensource machine learning and deep learning libraries like Keras and TensorFlow.

Dataset Description
The State Farm Distracted Driver Detection dataset used in this work was obtained from Kaggle. This dataset comprises more than 20,000 image data, totalling an overall size of approximately 8 GB. All the dataset images had the same dimension, 480 * 480 pixels, and several driver images in various driving postures. The pictures were classified into ten classes, as shown in Tab. 1. The different deep CNN models were trained to predict the likelihood of the driver's distraction in each picture. Fig. 5 shows the demo pictures from each of the ten classes of images. Further, this dataset possesses the distribution of more than 20,000 images into the ten distinguished classes. The histogram visualized in Fig. 6 shows that approximately 2500 image data are present under each class. However, one exception is the number of images in class C8, which consists of people talking to a passenger. This category has 4000 data compared to the other class images, whose average frequency is around 2350.

Data Preprocessing
Certain observations were drawn after acquiring and evaluating the information about the dataset. Not all the pixel values contributed equally to the class value assigned to a particular image. For example, in most cases, hands and head positioning play a vital role in determining the image class. The images are preprocessed to remove the background noise, which barely contributed as a prominent feature for the evaluation. The image data was converted into 64 * 64 pixels from its original resolution of 480 * 480 pixels. The images had many background noises not required for the prediction, such as the windshield and the seats. The essential characteristics of the image are the positioning of hands, head, and legs. Hence, unwanted information was removed using image processing techniques like grey-scaling and thresholding. The mean RGB values of every image in the dataset were determined, and these values were 95.124, 96.961, and 80.123. Every image's pixel values were subtracted by the mean value to retain only valuable information for the training model. The position of arms, head, legs, and any new object was still clearly identifiable, making the image appropriate for further processing by the deep CNN models.

Execution of ResNet Model
The ResNet model used fivefold cross-validation to verify the results' stability and authenticity. A checkpoint was created after each set of validations to avoid the loss of the stored weights. Further, each cross-validation was set to run with ten epochs, and the various performance evaluation metrics were determined. As shown in Fig. 7, the model was prepared using the ResNet50 layer using the 'Imagenet' data as its weights, as available in the Keras library. Next, these values were flattened using a flatten layer. The ResNet deep CNN model was fine-tuned with the dense layer using the 'Softmax' function. Further, to utilize an adaptive learning rate, Adam optimization was used instead of Gradient Descent optimization.

Execution of Xception Model
The Xception model was set up using transfer learning, utilizing a pre-trained VGG16 model. Like the ResNet model, in the Xception model, each cross-validation was run with ten epochs, and the various evaluation metrics were determined. As shown in Fig. 8, the Xception model was prepared using the Xception layer with the weights trained using the 'Imagenet' dataset. The shuffle parameter was set to true, and the verbose parameter was set to 1. Further, these values were flattened using a flatten layer. The Xception model, like the ResNet model, was finetuned with the dense layer using the 'Softmax' function. Adam optimization was used instead of Gradient Descent optimization, and the loss parameter was set to 'Categorical Crossentropy.'

Execution of VGG16 Model
The VGG16 model was set up with the Softmax function and the ReLU activation function. The ReLU activation function helped filter out the negative values and pass only the non-negative values onto the next layer. The fully connected layers were initially added to the network with appropriate activation functions. Two dense layers were used with 1024 and 512 units, respectively, in the initial few layers, utilizing the ReLU activation function. After implementing the two dense ReLU layers, a dense Softmax with ten units was added to the network. Ten units were used to predict the occurrences of the ten distraction classes created. The Softmax layer finally returned a value in the range of 0 to 1, based on the distracted drivers' image class (C0 to C9). Further, while training the model, Adam optimization was used, rather than the Stochastic Gradient Descent (SGD), to reach the global minima. The learning rate was set as 1e−5. This learning rate was tweaked several times to reach the current results. The description of the VGG16 Model network is shown in Fig. 9. The input data was passed through these different layers. The fully connected dense layers were included in the model, and finally, a ten-unit output was used to classify the images under the ten distraction classes.

Results and Discussions
The performance comparison was accomplished based on the evaluation metrics-precision score, recall/sensitivity score, F1 score, and specificity score [34]. True positive, true negative, false positive, and false negative values were used to compute the evaluation metrics [35][36][37][38][39][40][41]. The results were plotted using Python's Matplotlib library for better interpretation and visualization. The results tabulated in Tab. 2 represent the evaluation metric scores for the ten classes of images obtained by the deep CNN ResNet model. The highest precision, recall/sensitivity, and F1 score were observed for the class label C7, and the lowest precision, recall/sensitivity, and F1 score were seen in class label C6. However, the specificity score was highest for the class label C9 and lowest for the class label C2.   The visualization in Fig. 10 shows the precision, recall/ sensitivity, F1 score, and specificity score for the ResNet model. Overall, this model performed well for all the class labels, especially the C7-C9 class labels. The evaluation metric scores obtained by the Xception model are tabulated in Tab. 3. The visualization of the evaluation metric scores for the Xception model is shown in Fig. 11, where the precision, recall/sensitivity, F1 score, and specificity score are plotted. It can be observed that these scores are lower than those of the ResNet model. The highest precision, recall/sensitivity, and F1 score were observed for the class label C6, while the lowest precision and F1 score were seen in the case of the class label C1, and the lowest recall/sensitivity was observed for the class label C4. The specificity score was highest for the class label C0 and lowest for C5. The evaluation metric scores obtained by the VGG16 model are tabulated in Tab. 4. It can be observed that these scores are lower than those of the ResNet and the Xception models.  The graphical visualization of the evaluation metric scores for the VGG16 model is shown in Fig. 12. The highest precision score was observed for the class label C6, and the lowest was observed for C2. Similarly, the highest recall/sensitivity score was seen in C2, and the lowest was observed for C7. The F1 score was maximum for C6 and minimum for C7. Also, the specificity score was maximum for C8 and minimum for C5. Comparing the evaluation metrics scores shows that the ResNet model provides the most superior performance, followed by the Xception model. Even though the VGG16 model yielded lower evaluation metric scores than the other two models, the results were satisfactory [42][43][44][45]. These models can be further optimized to prevent the overfitting issue in the network. Fine-tuning the learning rates or the hyper-parameters and/or adding or removing layers can also optimize the model. The activation functions such as ReLU, Sigmoid, and Softmax functions could also be more efficiently used for achieving better results.

Conclusion
After implementing all the deep CNN models-ResNet, Xception, and VGG16-it can be concluded that the ResNet model provides the most superior performance, followed by Xception and VGG16, respectively. The evaluation metrics used for comparing the models' performances were the precision score, the recall/sensitivity score, the F1 score, and the specificity score. The dataset consisted of distracted driver images, and this work classified them into ten classes based on the distractions. Even though the VGG16 model is primitive compared to the other two models, it offers satisfactory results. However, as the complexity of the images and the dataset increases, the differences tend to become more prominent, and the superior performance of the ResNet model becomes evident. The advantage of using the ResNet deep CNN architecture for the distracted driver dataset is that the layers are stacked better while having much lesser kernels than in the VGG16 model. The ResNet model is less complicated and can be easily optimized compared to the other networks. Also, this model converges faster and generates better results than other networks. Furthermore, by using the ResNet deep CNN architecture for detecting the driver's distraction, the system can also create various alerting prototypes in the future by integrating cloud technology, the Internet of Things, and other disciplines. Moreover, alarm systems can be installed to detect the driver's distraction and ensure road safety. In conclusion, these systems help reduce accidents and guarantee self-awareness in drivers by continuously alerting them.

Funding Statement:
The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.