Traffic accidents are caused by driver fatigue or distraction in many cases. To prevent accidents, several low-cost hypovigilance (hypo-V) systems were developed in the past based on a multimodal-hybrid (physiological and behavioral) feature set. Similarly in this paper, real-time driver inattention and fatigue (Hypo-Driver) detection system is proposed through multi-view cameras and biosignal sensors to extract hybrid features. The considered features are derived from non-intrusive sensors that are related to the changes in driving behavior and visual facial expressions. To get enhanced visual facial features in uncontrolled environment, three cameras are deployed on multiview points (0°, 45°, and 90°) of the drivers. To develop a Hypo-Driver system, the physiological signals (electroencephalography (EEG), electrocardiography (ECG), electro-myography (sEMG), and electrooculography (EOG)) and behavioral information (PERCLOS70-80-90%, mouth aspect ratio (MAR), eye aspect ratio (EAR), blinking frequency (BF), head-titled ratio (HT-R)) are collected and pre-processed, then followed by feature selection and fusion techniques. The driver behaviors are classified into five stages such as normal, fatigue, visual inattention, cognitive inattention, and drowsy. This improved hypo-Driver system utilized trained behavioral features by a convolutional neural network (CNNs), recurrent neural network and long short-term memory (RNN-LSTM) model is used to extract physiological features. After fusion of these features, the Hypo-Driver system is classified hypo-V into five stages based on trained layers and dropout-layer in the deep-residual neural network (DRNN) model. To test the performance of a hypo-Driver system, data from 20 drivers are acquired. The results of Hypo-Driver compared to state-of-the-art methods are presented. Compared to the state-of-the-art Hypo-V system, on average, the Hypo-Driver system achieved a detection accuracy (AC) of 96.5%. The obtained results indicate that the Hypo-Driver system based on multimodal and multiview features outperforms other state-of-the-art driver Hypo-V systems by handling many anomalies.
Internet of things (IoT) [
Driver Hypo-vigilance (Hypo-V) [
Hypo-Driver is developed through training of the visual and non-visual features from 2-D images and 1-D sensor signals, respectively. To build a Hypo-Driver system, the new pre-trained convolutional layers and dropout layers are added to recognize driver distraction and fatigue into five-stages. The paper’s primary work is to develop a system to detect real-time hypovigilance-level of drivers by using multimodal features and recognized by deep learning architecture. A real-time driver distraction and fatigue (Hypo-Driver) detection system is developed in this paper through multi-view cameras and biosignal sensors to classify five states of hypovigilance. To develop Hypo-Driver system, the physiological signals (electroencephalography (EEG), electrocardiography (ECG), electro-myography (sEMG), and electrooculography (EOG)) and behavioural information (PERCLOS70-80-90%, mouth aspect ratio (MAR), eye aspect ratio (EAR), blinking frequency (BF), head-titled ratio (HT-R)) are collected and pre-processed, then followed by feature selection/reduction and fusion techniques. After fusion these features, the Hypo-Driver system is classified Hypo-V into five stages based on trained layers and dropout-layer in the deep residual neural network (DRNN) model. These steps are explained in the Section 3.
Multimodal-based features used to define the various level of hypovigilance (Hypo-V) have received a lot of interest in recent research due to their capacity to leverage deep models to recognize different activities of drivers. Many authors currently use a variety of data sources [
Authors in reference [
In reference [
Several other multimodal approaches have been investigated in the past. In reference [
We are attempting to solve two distinct issues in this study. First, the driver distraction, and second the level of distraction. In this paper, we evaluate multimodality by using late fusion approach. The pre-processing and feature extraction processes for each of the modeling approaches are discussed in the following paragraphs. To develop this Hypo-Driver system, the behavioral and psychological features are extracted from each driver to detect multistage of Hypo-V. For behavioral features, three cameras are deployed on different angles (0°, 45°, and 90°) of the drivers to get Multiview features. To develop this hypo-Driver system, the physiological signals (EEG, ECG, sEMG, and EOG) and behavioral information (PERCLOS70-80-90%, mouth aspect ratio (MAR), eye aspect ratio (EAR), blinking frequency (BF), head-titled ratio (HT-R)) are collected and pre-processed, then followed by feature selection/reduction and fusion techniques. The driver behaviors are classified into five stages, namely, normal, fatigue, visual inattention, cognitive inattention, and drowsy. The Hypo-V consists of four major modules called vision module, sensors module, fusion module, and prediction module, which are explained in the upcoming subsections. A systematic flow diagram is displayed in
In this paper, we use statistical features extracted from four sensors such as physiological signals (EEG, ECG, sEMG, and EOG). In total, we use 80 features that are extracted from these physiological sensors, including 20 features from EEG, 10 features from ECG, 20 features from sEMG, and 30 features from EOG signals. These features are concatenated into a single feature vector. Compared to other sensors, the EEG signals are having special properties to define different levels of hypovigilance. For example, the TGAM sensors kit is directly connected to brain signals, and it is easy to evaluate the distraction level of each driver. Information regarding brain activity is included in brain signals. The alpha, delta, and theta signals are the most employed to evaluate the level of tiredness and distractions. When the driver’s level of awareness drops, the three signals alter to various degrees. The delta and theta signals increase rapidly, and the alpha signal increases somewhat but not as much as the first two signals. In addition, we have collected attention levels in the range of (60 to 75), mediation levels in the range of (50 to 60), and blink strength in the range of (40 to 55).
Physiological signals (EEG, ECG, sEMG, and EOG) captured from diverse sensors contain noise and some signals are not important for the classification task of five stages of Hypo-V states. As a result, it is required to reduce noise signals and use only distinct information signals. For solving these problems, the shearlet wavelet transform (SWT) [
The behavioral information (PERCLOS70-80-90%, mouth aspect ratio (MAR), eye aspect ratio (EAR), blinking frequency (BF), head-titled ratio (HT-R)) are collected from three cameras, which are deployed on different angles (0°, 45°, and 90°) of the drivers to get Multiview features. These cameras are used in both real-time vehicles and simulators to get the performance of the proposed Hypo-Driver system. Since the driver’s face is moving in different directions during driving, so we deployed a multiview camera [
3-D cameras are used to detect facial features such as MAR, EAR, BF, and HT-R through computer vision applications. These visual features include the percentage of closed eye time, closed eye time, blink frequency, yawn frequency, and nodding frequency from the most viewed 3-D face image. Several researchers are using PERCLOS measures to indicate the level of driver drowsiness. The mouth state, eyes, and yawing conditions are widely used criteria for extracting facial features. Based on each viewpoint camera, we compute 25 features from every frame in the video recording dataset at 30 fps. To extract visual features, we utilize 6 windows of size (25 × 25) on each to extract statistical features with 50% of overlap. To define statistical features one each window, we extract minimum and maximum values, average, variance, skewness, and kurtosis. At the end of this process, each window is summarized to a 25 × 6 = 150 feature vector. In addition, the convolutional neural network (CNN) classifier [
The CNN network is a type of DNN used mainly for analyzing visual imagery. The model consists of two essential portions: feature extractor and classifier. Every layer collects the output from the preceding layer as input and sends its output to the next layer’s input in the feature extractor module. Each layer extracts different features from the image by applying the convolution operation on input. As the network got deeper and deeper, it extracts more higher-level features from the image. The classifier part of the network consists of a fully connected layer that calculates each class’s score from extracted features and classifies the images according to its calculated score. In addition, CNN’s typically have fully connected layers at the end, which computes the final outputs. To extract behavioral features, we utilize CNN architecture based on the convolutional layer, pool layer, rectified linear unit (ReLU) layer, and fully connected layers. The features are computed from every 6 windows on each frame. Similarly, the convolutional layer is used to extract statistical visual features as mentioned-above such as we have extracted minimum and maximum values, average, variance, skewness, and kurtosis.
Physiological signals (EEG, ECG, sEMG, and EOG) are captured from diverse sensors and contain noise. Also, some signals are not important for classification task of five-stages of Hypo-V states. As a result, it is desirable to reduce noise signals, and use only distinct information signals. For solving these problems, the shearlet wavelet transform (SWT) is used. The pre-processing step is already explained in the Section 3.2. The features selection step for physiological features is performed using recurrent neural network and long short-term memory (RNN-LSTM) model [
The RNN-LSTM model has been used in many studies to extract physiological signals. Therefore, we have used this deep learning model to train the network and extract effective features. In practice, the RNN-LSTM is the type of network that utilizes time-series data or serial data to solve temporal problems. They are of many types, like one-to-one, one-to-many, many-to-one, and many-to-many. The variant of RNN architectures include birecurrent neural networks (BRNN) drawn from preceding inputs to make forecasts about the present state. They pull in future data to progress the correctness of it. The Long short-term memory (LSTM) works to cover the problem of long-term needs. If the initial state that is persuading the existing prediction is not in the recent past, and the model cannot predict the present state. The recurrent neural network (RNN) is a deep learning method that is used to create long short-term memory (LSTM). RNNs are made up of recurrent structures that feed the firing strength locally, eliminating the need for external registers or memory to store past outputs. Due to the usage of recurrent structures in RNN, LSTM has a low computational complexity. The LSTM is integrated in this paper with pooling and fully-connected layers to use physiological features.
As shown in
The multimodal features fusion approach is utilized in this paper for the development of multistage driver fatigue and distraction level. There are two types of fusion approaches used in the past, called late and early fusion. Compared to early fusion, the late fusion approach provides the best classification performance. There are two fully connected layers (FC) that contain statistical behavioral and physiological features. These two different modalities FC layers are directly considered every three seconds over five seconds. From physiological and behavioral modalities, we use two features vector. In this paper, we categorize each modality separately in the late fusion method, then combine the probability of the two models to provide the final prediction through dense architecture described in Section 3.6.
To perform late fusion, we aggregate the predicted class probabilities supplied by each modality model and then average the class probabilities produced by the visual classifier for every three consecutive samples. The label with the highest final value is allocated. After that, the representation is standardized with a conventional normalization and fed into the classification algorithm. In this study, we investigate late fusion approach collected from two modalities, in the hope of improving the detection approach of driver hypovigilance state using fused data. Multimodal fusion in the late stages is investigated by the following techniques by combining the SoftMax scores provided by each FC layer that is the product and weighted sum. Since the fusion is done on the categorization scores, these techniques are referred to as late fusion. We calculate the SoftMax scores (
The parameter
In this paper, deep residual neural network (DRNN) models are utilized. In practice, the DRNN model is very easy to optimize compared to other deep-learning architectures. To achieve high accuracy, it is very effective to get optimized results sometimes needed for real-time processing. Compared to convolutional neural network (CNN), the DRNN models are better in case of optimization. In [
Many authors claim that the DRNN model is different in terms of network architectures compared to convolutional neural network (CNN) model when used in case of features selection and classification tasks. In case of a typical CNN model, it consists of different layers such as convolutional, features map, pooling, and output layers. However, in case of DRNN model, the input layer is directly connected with the output layer to get direct classification results. As a result, it has a shortcut pathway directly connecting the input and the output in a building block. Accordingly, the DRNN model has been selected in this paper to extract and classify visual and non-visual features without using complex image processing techniques. A visual example of DRNN architecture is shown in
This can be recovered from the original input through learning and represented as:
In practice, this DRNN network has the capability, compared to other deep-learning models, to learn residual sub-network to define discriminative features for recognition tasks. In most of the studies, the authors have utilized 8 × 8, 16 × 16, 32 × 32 or 64 × 64 convolutional filters of variable sizes such as (7 × 7, 8 × 8, and 9 × 9) to convolve input image and then generate different features maps for the following layers. Subsequently the convolutional layer, there are a ReLU activation layer, a maximum pooling layer, a batch normalization layer, a dropout layer, and fully connected output layer. As a result, all layers in the network have the capability to capture spatial relationship among pixels while having noise pixels. The main purpose of all these layers is to extract statistical properties among different objects in an image. Therefore, this DRNN model is used in this paper to classify five stages of hypo-vigilance state of drivers. These layers are briefly explained in the subsequent paragraphs. To construct DRNN model, the first and important step is to create a main branch of the network by constructing a convolutional layer. A main branch of the network is constructed through different sections. In this paper, three different sizes of convolutional filters are used (32 × 32, 16 × 16, and 8 × 8). The pre-trained features map is calculated by:
After ReLU-layer, many authors utilize a pooling down-sampling layer (PO-layer) to find out more prominent features compare to other pixels. Many authors use max-pooling function with a filter and stride of usually size of 2 × 2 pixels. However, there are also many different techniques such as average pooling and L2-norm pooling can also be utilized to construct this PO-layer. The instinctive thinking behind this layer is that once a particular feature in the input data window is realized (the filter convolution yields high value), the relative location to other features is the most important. Overfitting alludes to when a deep learning model is heavily tuned to the training sets that it can’t sum up well for the validation and test sets. The issue of overfitting is resolved using dropout (DRP-layer) layer. The ReLU is used to achieve nonlinear transformations by forcing the negative features to be zero. It is expressed by:
where f(i, j) and R(i, j) are the input and output feature maps of the ReLU, respectively. The derivative of the ReLU is expressed by
To make it easy in the training process, these components of DRNN model provide comfort compared to conventional convolutional neural network (CNN) architecture. The problem is that the gradient of error regarding weights and biases requires to be back-propagated layer by layer. Each layer is depended on the previous layer. Moreover, this is related to optimization problem due to selection of many features. However, in case of DRNN model, some layers can go deeper so that the gradients can be back-propagated very easily. Moreover, the batch normalization is also added to the model for increasing the process of training and lessen the burden on the initial weight’s values. In this model, the training phase provides the initial weights. However, to increase the learning process, this normalization layer is added to the network to force the negative values to zero. The final fully connected output layer, is expressed by:
The Hypo-Driver system has been implemented and tested on Intel ® Core i7-8600U processor, 16 GB RAM, and NVidia 2 GB of graphical-processing unit (GPU). An open-source environment (Python 3.6) has been used to develop deep learning (DL) models. In this paper, we integrate the DL models such as CNN, RNN-LSTM, and residual neural network (DRNN). The multimodal features are extracted by using behavioral and physiological features through 3-D cameras and various sensors mounted on Arduino board. All cameras and sensors wires are directly connected to Arduino board and then the Arduino serial port is connected to the system. To get EEG signals, we use NeuroSky ThinkGear ASIC module (TGAM) that is connected through Bluetooth. To analyze heart rate variability (HRV), the python serial (PySerial) library is imported to Python for communicating with the Arduino board. All deep learning (DL) models have built-in TensorFlow and Keres platforms. Those models are imported to the Jupiter environment to make good documentation of the code. Moreover, the training process is performed by using Hypo-DB as mentioned in sub-section 3.1. To perform comparisons with the other state-of-the-art hypovigilance detection systems, four studies are selected such as Du-RNN [
The parameter setup used for comparisons with other methods is explained in the subsequent paragraphs. In Du-RNN [
Comparison is performed based on twenty drivers’ preprocessed physiological and behavioral features as mentioned in Section 3.2. This driver’s dataset is divided into 30% of the testing set and the rest as the training set. Also, we have split again the training set into 30% and assigned it to the validation set, and the rest is used for training purposes. On a total of 20 driver’s datasets, 55% is working as the training set, 15% as the validation set, and 30% as the test set. There are 100 epochs performed based on the 10-fold cross-validation set.
In other papers, the authors including Hypo-Driver use a CNN model. A convolutional neural network (CNN) model is having a multilayer architecture that consists of convolution, nonlinear, pooling, and finally SoftMax-connected layers. This CNN model with multilayer architecture is used to detect driver drowsiness and distraction levels. The CNN model has been used in many drivers’ fatigue detection to detect and predict features extracted from drivers. Those features are broadly classified in terms of visual and non-visual features. For comparisons, we have set up the following parameters to design the CNN model. For example, the input layer has features extracted from the PERCLOS measure. These features are defined from the driver image of size (256 × 256) pixels. This input layer also contains data, which is taken from EEG sensors. We have converted the original signal from sensors into feature space by convolution neural network (CNN). The sensors’ values are transformed into a 3 × 128 features matrix. Afterward, the convolutional layer is added and developed through window size per image and size of each neuron is used to determine the area. We used three convolution kernels (C1, C2, C3) filters by defining the filter sizes as 5 × 5, 5 × 5, and 3 × 3, respectively. Also, we use another pooling layer to map features by using average and maximum features values. To avoid features overfitting, we reduce the dimensionality of the convolution layer by doing experiments. The sizes of pooling layers are set to 2 × 2, 2 × 2, and 3 × 3 layers in this CNN model. The parameter setting values used in the RNN-LSTM and DRNN architectures consists of (Optimizer:Adam, Learning Rate: 0.001, dropout rate 0.2, Loss Function: Categorical cross entropy, Batch size: 10, Epochs: 100).
Compared to deep-learning algorithms, the machine-learning algorithms have also been developed in the past to detect multimodal-based hypo-vigilance detection such as in Chen-SBL [
Despite the tendency to learn from the training data, the loss is very high for most combinations of parameters, and the abrupt decrease of the loss for two of these combinations is just an illustration of over-training, but not the mark of the very reliable model. The more reliable results by CNN can be obtained for a bigger number of parameters.
Several different sensors have been utilized in the past for predicting of level of drowsiness and distractions. However, those systems are more focused on extracting physiological, and behavioral features. In addition to these features, the authors are also using environmental and vehicular parameters to detect drivers’ activity. However, it is easy to fuse physiological, and behavioral features than other type of parameters. Accordingly, we have used physiological, and behavioral features to develop driver drowsiness and distraction level. Compared to other systems, due to several variables such as nighttime driving, head not being centered-aligned, and occlusion of faces, especially in female drivers, extracting visual-features for defining PERCLOS measure is challenging. Due to these reasons, it is difficult to define behavioural features. To address these issues, we use multiview cameras and integrate different sensors to define behavioural features by using CNN, RNN-LSTM and DRNN deep learning models in a real-time and simulator environment. The classification of five-stages is carried out through DRNN model in this paper. In DRNN model, we have added pre-trained layers and dropout-layer. The obtained results indicate that the Hypo-Driver system outperforms other state-of-the-art systems.
In future studies, we intend to integrate different technologies such as cloud computing and GPU-based processing to enhance the computational power of the Hypo-Driver system. Nowadays, IoT-based applications, also called ubiquitous sensing, are taking the center stage over the traditional paradigm. The evolution of IoT necessitates the expansion of the cloud horizon to deal with emerging challenges. In future research, we will review all those cloud-based emerging services, useful in the IoT paradigm, that support effective data analytics for the detection of driver fatigue and distraction level.
Multiview driver fatigue, inattention, and distraction level detection system known as Hypo-Driver using Multimodal Deep Features is developed in this paper. Three cameras are deployed on multiview points of the drivers to get uncontrolled behavioral features. The physiological signals (EEG, ECG, sEMG, and EOG) and behavioral information (PERCLOS70-80-90%, mouth aspect ratio (MAR), eye aspect ratio (EAR), blinking frequency (BF), head-titled ratio (HT-R)) are collected and pre-processed to develop the Hypo-Driver system. This improved Hypo-Driver system utilizes trained behavioral features by a convolutional neural network (CNNs) and the RNN-LSTM model is used to extract physiological features. After fusion of these features, the Hypo-Driver system is classified Hypo-V into five stages based on trained layers and dropout-layer in the deep-residual neural network (DRNN) model. To test the performance of a Hypo-Driver system, data from 20 drivers are collected. The results of Hypo-Driver are presented and compared with state-of-the-art methods. Compared to the state-of-the-art Hypo-V system, on average, the Hypo-Driver system achieves an ACC of 96.5%. The obtained results indicate that the Hypo-Driver system outperforms other state-of-the-art driver Hypo-V systems on multimodal features.
The authors extend their appreciation to the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University for funding this work through Research Group no. RG-21-07-01.