Facial Expression Recognition (FER) has been an important field of research for several decades. Extraction of emotional characteristics is crucial to FERs, but is complex to process as they have significant intra-class variances. Facial characteristics have not been completely explored in static pictures. Previous studies used Convolution Neural Networks (CNNs) based on transfer learning and hyperparameter optimizations for static facial emotional recognitions. Particle Swarm Optimizations (PSOs) have also been used for tuning hyperparameters. However, these methods achieve about 92 percent in terms of accuracy. The existing algorithms have issues with FER accuracy and precision. Hence, the overall FER performance is degraded significantly. To address this issue, this work proposes a combination of CNNs and Long Short-Term Memories (LSTMs) called the HCNN-LSTMs (Hybrid CNNs and LSTMs) approach for FERs. The work is evaluated on the benchmark dataset, Facial Expression Recog Image Ver (FERC). Viola-Jones (VJ) algorithms recognize faces from preprocessed images followed by HCNN-LSTMs feature extractions and FER classifications. Further, the success rate of Deep Learning Techniques (DLTs) has increased with hyperparameter tunings like epochs, batch sizes, initial learning rates, regularization parameters, shuffling types, and momentum. This proposed work uses Improved Weight based Whale Optimization Algorithms (IWWOAs) to select near-optimal settings for these parameters using best fitness values. The experimental findings demonstrated that the proposed HCNN-LSTMs system outperforms the existing methods.
FER in humans has developed into a significant subject of study in these twenty years. Humans transmit their emotions and intentions through the face and face is the most powerful, natural, and instantaneous way of assessing emotions. Individuals use facial expressions and verbal tones to infer emotions like joy, sadness, and anger. Computer visions recognize facial expressions that reflect anger, contempt, fear, pleasure, sadness, and surprise. Many studies [
Detections, feature extractions, and facial expressions categorizations are the basic phases in automated FERs. Studies have used several feature extraction approaches, including Scale-Invariant Feature Transform (SIFT) [
Despite the use of DLTs in FERs difficulties persist as they need a substantial quantity of training data to avoid overfitting [
The study in [
Zhang et al. [
Zeng et al. [
FaceNet2ExpNet presented by Ding et al. [
Li et al. [
Sun et al. [
Alenazy et al. [
Ryu et al. [
This study proposes the HCNN-LSTMs approach for FERs where the proposed scheme is divided into four stages: pre-processing, facial detections, feature extractions, and classifications.
In this work, FERC Dataset is taken as the input for facial expression recognition. It contains two directories, namely train and test and file labels. Both train and test directories contain seven classes: anger, disgust, fear, happiness, neutral, sadness, and surprise.
Pre-processing facial pictures is a very crucial process in FER. The image noises can be filtered using filters, and this study uses GFs (Gaussian Filters) to smoothen noises present in facial images. The GFs usage based on a 2-D convolution smoothing operator where current image pixel values are replaced by weighted mean computed from the pixel’s neighborhood. This weight decreases as the pixel’s distance from the centre increases. GFs are similar to mean filters but use a different kernel to reflect Gaussian (‘bell-shaped’) humps.
where, σ is the standard deviation and x, y is the local coordinate of an image.
The faces from images are detected using VJ algorithm in this research work. This technique is used as it has high detection rates in real-time data. This VJ algorithm can be divided into four stages: Haar feature selections, integral images, Adaboost training, and cascading classifiers.
Haar features are visual characteristics used for object detection. VJ algorithm modifies Haar wavelet concepts to generate Haarlike characteristics. Many humans share a few common characteristics, like eyes in images are darker than their neighboring pixels while the nose region looks brighter than the eyes. Haar features can be assessed where two or three rectangles are used to determine the existence of a face in images. Each Haar feature’s value can be computed by calculating the area of each rectangle and then adding it up.
Integral images are a kind of image representation used in the pre-processing phase, and integral images are computed. VJ algorithm starts by converting input facial images into integral images, which is accomplished by setting each pixel equal to the total sum of all pixels above and to the left of the pixel. Integral images can be computed using
where
Integral images are summations of rectangular regions where the sum of all pixels within a rectangular region Z = [A1, A2, A3, A4] can be computed using
Features are determined in a specified time since the sum of the pixels in the component rectangles can be computed in constant time. VJ discovers the values through a detector and a basic resolution of 24*24 pixels for producing good results.
AdaBoost is an MLT that finds the correct hypothesis by combining multiple weak hypotheses with average accuracy. Adaboost is used in this work as it aids in the selection of only the best features in a pool of features. Once it establishes required features, a weighing of features is used to determine if the feature is the part of the window with a face. The characteristics are often called weak classes.
The use of multiple/cascade of sophisticated classifiers can produce higher detection rates. VJ face detection algorithm works by scanning detectors often with different sizes of the same image where the disproportionate number of assessed sub-windows may not face. Cascade classifiers are made of layers where each layer has a powerful classifier, and each level is responsible for determining if a particular sub-window is a face/non-face.
This research work uses HCNN-LSTMs for its feature extractions and facial expression categorizations. The CNNs used in the study include training alternatives that can directly impact performance successes. The essential parameters for training include maxepochs, minbatch sizes, initial learning rates, regularization parameters, shuffle types, and momentum. This work’s IWWOAs select near-optimal settings of parameters used in this work.
CNN is one of the most powerful DLTs that can function with several hidden layers, convolute these layers, and sub-sample to extract low/high-level features from input data. CNN typically has convolution, sub-sampling/pooling, and fully connected layers as depicted in
This work uses CNNs for two main things: feature extractions and classifications. Multiple convolution layers are used for feature extractions, and max-pooling and activation functions are subsequently used. Classifiers are typically made up of completely linked layers.
Convolution layer is a key component of CNNs in this study and is used for feature extractions. It includes linear/non-linear processes including convolution and activation functions. Input features are convolved with a kernel in this layer for filtering (convolution matrix kernels) and producing n output features identified from input images. The output features are generated by convolving the kernel and inputs referred to as feature maps of size i*i.
CNN has many convolution layers, and output features are inputs to subsequent convolution layers. Each convolution layer has n filters convolved with inputs, resulting in (n*) feature maps, which equals the number of filters used in convolutions. It should be noted that each filter map is regarded as a distinct feature at a given position in inputs. The outputs from
where,
where,
ReLUs are used in DLTs as it reduces interactions and nonlinear effects. When the input is negative, ReLUs convert outputs to 0 while positive values are unchanged. The main advantage of activation functions is its accommodation of quicker training based on an error derivative, which becomes extremely tiny in the saturating area, and therefore weights that get updated virtually disappear also referred to as vanishing gradient problem.
Reducing dimensionality of input features maps produced from preceding convolutions is the primary goal of this layer. Sub-sampling process is carried out between masks and feature maps where a mask of size bb is chosen. The sub-sampling procedure between masks and feature maps can be depicted mathematically as
where down (•) is a sub-sampling function and sums each individual n-by-n features in the input t, resulting in an output that is n-times smaller in spatial dimensions. Each output map has its own multiplicative bias and additive bias b.
The convolution layers extracted features and were sampled by pooling layers translated into the network’s final output. The probabilities for each class in classification tasks are a subset of fully connected layers. The Softmax activation function used in the output layer and detailed below:
where,
The LSTMs are a type of RNNs (Recurrent Neural Networks), but describe temporal sequences and long-term relationships within them better than standard RNNs.
The input gate of LSTM is defined as
The forget gate is defined as
The cell gate is defined as
The output gate is defined as
Finally, the hidden state is computed as
tanh - hyperbolic tangent activation function, xt - input at time t and W and b - network parameters (Weights and Biases).
The σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate and cell state. Wci, Wcf and Wco are denoted weight matrices for peephole connections. In LSTM, three gates (i, f, o) control the information flow. Input gates determine input ratios which influence determining cell states
The proposed HCNN-LSTMs approach takes a facial image as input and extracts features followed by dimensionality reduction where two layers of CNNs are used. The convolution layers use learnable filters to identify certain characteristics in the original image where different filters identify them and convolve them, resulting in a set of activation maps. These generated maps are then fed into LSTMs after being reduced in spatial dimensionality. LSTMs further acquire temporal features of images for accurate recognition of facial expressions. The two fully connected layers that existed in CNN architecture which refers to FC1 and FC2 are used to categorize the facial emotion recognition as happiness, disgust, fear, angry, neutral, sad and surprise based on the probabilities for each class in classification tasks.
Increasing the number of training samples directly affects the success rates of DLTs. Tuning DLT’s hyperparameters like max epochs, min batch sizes, initial learning rates, regularization parameters, shuffle types, and momentum improves the model’s performances. This work’s proposed IWWOAs select near-optimal settings for these parameters based on which CNNs train on adjusted hyperparameter values. Finally, the testing set is categorized using a trained model to assess the effectiveness of the suggested technique.
Mirjalili and Lewis presented a meta-heuristic nature-inspired that mimics the real-life behavior of whales. WOAs are swarm-based, mimicking humpback whales’ social behavior and inspired by their bubble-net tactics while hunting. These whales are the biggest group of baleen whales and always stand together. They hunt tiny groups of krill and small fish near the surface by forming bubbles along with a spiral pattern around their prey and then swimming up to the surface along this path. Their behaviors can be represented numerically.
To improve the performance of the WOAs, this work uses a time-varying inertia weight along with the update
where,
The vectors
where
As discussed in the preceding section, humpback whales use the bubble-net technique to attack their prey. The following is a mathematical formulation of this approach.
where,
The main steps of the proposed HCNN-LSTMs are described as follows
1: Prepare the
2: Process Face alignments
3: Convert RGB images to grayscale
4: Apply GDs for reducing noises
5: Facial detections
6: Split data into Training and Testing sets
7: Apply HCNN-LSTMs
8: Set initial values for IWSO control parameters
9: Set initial values for
10. Set initial values for of
11: Set initial values for the fitness function
12: Execute until stop criterion of IWSO is not achieved
13: Perform IWSO algorithmic phases
14: Obtain near-optimal values for the hyperparameter as output
15: Train HCNN-LSTMs on the
16: Output the trained model
17: Evaluate the success of HCNN-LSTMs on the
Python was used in implementations of the proposed HCNN- LSTMs FERs and tested on FERC dataset found on
Initially, the input is taken from the FERC of Dataset and filtered using a Gaussian filter. The input and filtered image is represented in
The measurement accuracy correctly identifies the weighted percentage of facial expressions. It is represented as,
where,
TP-True Positive
TN-True negative
FP-False Positive
FN-False Negative
Precision is the ratio of properly predicted positive results to the total predicted observations.
Recall is the ratio of correctly predicted positive results to all observations in actual class.
F-measure has represented the weighted average of precision and recall.
Metrics | Methods | ||
---|---|---|---|
LSTM | CNN | HCNN-LSTM | |
Accuracy | 0.55 | 0.66 | 0.77 |
Precision | 0.62 | 0.63 | 0.78 |
Recall | 0.55 | 0.66 | 0.77 |
F-measure | 0.44 | 0.59 | 0.72 |
This research work has proposed and demonstrated an approach for FERs called HCNN-LSTMs, a Hybrid approach combining CNNs with LSTMs. GFs eliminate noise from images in this work, and VJ algorithm identifies faces. This study increased the recognition rates by its use of the proposed HCNN-LSTMs in feature extractions and classifications. CNN use the parameters of maxepochs, min batch sizes, initial learning rates, regularization parameters, shuffle types, and momentum in training which IWWOAs tuned to select near-optimal hyperparameter values. This study’s experimental findings demonstrated the proposed system’s superior performances by outperforming prior systems in accuracy, precision, recall, and f-measure.