Depression is a mental psychological disorder that may cause a physical disorder or lead to death. It is highly impactful on the social-economical life of a person; therefore, its effective and timely detection is needful. Despite speech and gait, facial expressions have valuable clues to depression. This study proposes a depression detection system based on facial expression analysis. Facial features have been used for depression detection using Support Vector Machine (SVM) and Convolutional Neural Network (CNN). We extracted micro-expressions using Facial Action Coding System (FACS) as Action Units (AUs) correlated with the sad, disgust, and contempt features for depression detection. A CNN-based model is also proposed in this study to auto classify depressed subjects from images or videos in real-time. Experiments have been performed on the dataset obtained from Bahawal Victoria Hospital, Bahawalpur, Pakistan, as per the patient health questionnaire depression scale (PHQ-8); for inferring the mental condition of a patient. The experiments revealed 99.9% validation accuracy on the proposed CNN model, while extracted features obtained 100% accuracy on SVM. Moreover, the results proved the superiority of the reported approach over state-of-the-art methods.
Mental disorders are extreme psychotic issues that cause patients to think, precept, and behave abnormally. When the disease has progressed to a severe condition, persons suffering from serious mental disorders struggle to maintain a connection to the real world and are generally unable to perform well in everyday life. Depression, schizophrenia, bipolar disorder, different dementia, psychoses, and formative disorders such as chemical imbalance are examples of mental disorders. Depression is a common mental disorder and one of the leading causes of disability worldwide. Globally, about 350 million people are suffering from depression [
Both clinical and non-clinical assessment strategies are used to diagnose depression. The clinical method, which is based on the seriousness of symptoms, comprises PHQ-8 and depends only on patient reports or clinical decisions taken under the light of the Diagnostic and Statistical Manual of Mental disorder (DSM-5). Recently, electroencephalogram and eye tracking are also used for depression detection. Symptomatic methods of depression have vibrant afflictions related to patient refusal, poor sensitivity, emotional predispositions [
All these hindrances make depression detection a toil exhaustive work. Recently, artificial intelligence, computer vision, and machine learning approaches have been applied to facial appearances for depression detection. Computer vision can recognize facial micro-expressions that can accurately detect the severity level and type of depression. Using computer-aided diagnostic systems, developed by integrating these approaches, every person can be screened in hesitation and hindrance-free manners. These systems could save the cost and time of both patient and the psychiatrist. Moreover, privacy can be ensured, which is most necessary due to associated stigma. Hence, such systems could prove themselves as effective, reliable, secure, non-invasive, real-time, robust, and facilitate early-stage detection.
Micro-expressions are transient and involuntary facial expressions, frequently happening in high-stakes circumstances when individuals attempt to cover or hide their actual emotions. These actions drive by the amygdala. They are excessively short (1/25 to 1/2 s) and unobtrusive for natural eyes to see [
According to Darwin [
The paper is organized as follows: Section 2 contains a review of related studies, while Section 3 embodies details about the data acquisition, pre-processing, and the proposed CNN model used for depression detection. In Section 4, experimental work is performed; this section also represents experimental results, their discussions, and comparison with the state-of-the-art methods. Conclusions and future directions are part of Section 5.
In the research article [
In the research [
The authors of this paper [
Furthermore, the system uses dynamic feature descriptors based on volume local directional number (VLDN) to capture facial movements. To gain more discriminative features, the feature map obtained from VLDN is fed into a CNN. The model also gathers temporal information by incorporating the temporal median pooling (TMP) technique through a multilayer bidirectional long short-term memory (BI-LSTM). The TMP method was used on spatial and temporal feature temporal fragments. The experiment is conducted on the AVEC-2013 and AVEC-2014 datasets, with an RMSE of 8.93 and an MAE of 7.04.
The proposed methodology consists of several steps to classify depression from facial images.
The dataset consisting of 3–5 min’ videos of each depressed patient was taken from the Department of Psychiatry, Bahawal Victoria Hospital, Bahawalpur. All the patients have undergone a consent form by providing their informed consent to take videos when conducting interviews based on PHQ-8. There were 180 depressed patients, as per a clinical trial conducted by a team of psychiatrists, and 200 healthy controls (normal). During the acquisition of the videos, interviews based on the questionnaire prepared according to the PHQ-9 and DSM-V were conducted. The details of the patients and healthy controls are shown in
Controls | No. | Gender | Age group | Marital status | Job occupation |
---|---|---|---|---|---|
Patient | 10 | Male | 22–39 | Unmarried | Job holder |
Patient | 20 | Male | 32–43 | Married | Job holder |
Patient | 20 | Male | 40–55 | Married | Job holder |
Patient | 10 | Male | 50–64 | Married | Job holder |
Patient | 40 | Female | 20–35 | Married | Housewife |
Patient | 30 | Female | 30–45 | Married | Job holder |
Patient | 20 | Female | 40–55 | Married | Housewife |
Patient | 20 | Female | 50–65 | Married | Housewife |
Patient | 10 | Female | 50–70 | Married | Housewife |
Normal | 70 | Male | 20–40 | Unmarried | Job holder |
Normal | 10 | Female | 20–55 | Unmarried | Job holder |
Normal | 20 | Male | 30–45 | Married | Job holder |
Normal | 30 | Female | 40–60 | Married | Housewife |
Normal | 40 | Male | 45–65 | Married | Job holder |
Normal | 30 | Female | 50–65 | Married | Housewife |
The recorded videos were converted to color frames of dimensions 256 × 256 each. The average time of each video is 3 to 5 min, with 24 frames per second. Approximately 1,036,800 depressed and 1,152,000 healthy controls images were obtained from these videos. This was a massive dataset concerning the computations involved during its analysis. So, another light-weighted version of this dataset was also obtained, where one frame out of two consecutive frames was part of this version of the dataset. Since micro expressions are transient and subtle movements, we may lose these expressions if we drop more frames. This justifies that we cannot drop more frames to reduce the volume.
To have noise-free information, the OpenFace tool [
CNN is a deep neural network that works on images and is applicable in every machine vision-based problem. It analyzes the image and maps it into further processable form for better prediction without losing essential features. CNN comprises two significant steps, i.e., convolution and pooling. The convolution process works as a feature extraction layer, which extracts features to preserve spatial information. The pooling layer minimizes the dimensions of the substantial features according to the application. Pooling functions are categorized into max-pooling and average-pooling. The reduced features are then fed into a fully connected layer that consists of an activation function.
The component that carries out “convolution” in the convolutional layer is the kernel/filter. This component performs multiplication between the input data and the two-dimensional array of weights in the case of a two-dimensional image. The convolutional layer is responsible for extracting low-level features, including edges, color, and gradient orientation. More layers are added to get the high-level features comprising micro-expressions that can easily identify the severity and type of depression. Consequently, a feature map is created for further processing. The convolutional process is shown in
In this layer, the spatial size of the convolved feature is reduced for managing the computing power to process the massive data through minimalized dimensions. Moreover, it is necessary for extracting the dominant features that are position or rotation invariant. This helps in the effective processing of the model. There are two types of pooling, i.e., max-pooling and average-pooling. In max-pooling, the maximum value is returned by the potion of an image covered by the kernel, while in average-pooling, the average of all numbers from the portion of an image is given, covered by the kernel. The performance of max-pooling is better than the average-pooling; therefore, max-pooling has been used in the proposed architecture.
The activation function converts the data from linear to nonlinear form. It is essential to convert data into non-linear forms to solve complex problems. Common activation functions include tanh, sigmoid, softmax, Rectified Linear Unit (RELU), and Exponential Linear Unit (ELU). During feature extraction by the proposed CNN model, rectified linear function is used to increase the non-linearity to learn the complex relationship in data. RELU can minimize the interaction effects as it returns 0 for negative values. This function act as a linear and non-linear function for two halves of the input data. It is primarily implemented in the hidden layer due to the formation of dead neurons. RELU function is computationally efficient because it has zero derivatives for negative numbers and 1 for positive numbers, and due to this, some of the neurons get activated. Its function is shown in
The pooled feature map is flattened before setting foot into the fully connected layer. The image is flattened into a column vector suitable for multi-level perceptron and fed to the fully connected layer. Then back-propagation applies to each iteration of training and the fully connected layer is responsible for learning a non-linear combination of high-level features. A fully connected layer has all the connections with the activation unit of previous layers. Over a series of epochs, the model can classify dominating and recessive features using the Softmax Classification technique. It allocates the values to the feature map vectors for each category to normalize them. Normalization is done to obtain a mean close to zero that accelerates the process of learning and is a prime factor for faster convergence. Convergence is the state when the neural network has reached a constant learning rate and does not improve further.
It is a feed-forward and non-linear activation function defined for real input values. It is used for predicting probability-based outputs. Its value lies between 0 and 1. It is a smoothing function that facilitates derivation and is suitable for classification. The derivation is needed in the neural network for calculating gradients. To optimize the neural network's learning with the constant derivative, it is impossible to calculate the optimal parameters. It has a vanishing gradient problem and is not zero-centered. Therefore, the learning is minimal and time-consuming. Mainly, it is used in the output layer and is better for binary classification. The sigmoid function is shown in
This hyperparameter of CNN architecture is used to prevent over-fitting. This makes the model efficient and more generalized. Its value lies between 0.0 and 0.9. Nodes are emitted temporarily from the neural network with all its incoming and outgoing connections according to the provided fixed probability value for thinning of the network weights. In the proposed CNN architecture, the dropout value has been fixed empirically as 0.5.
In gradient descent, the batch size is a hyperparameter that refers to the number of samples in the forward/backward pass before appraising the internal model limitations. The batch size may be 32, 64, 128, 256, or 512. In the proposed CNN architecture, the batch size value has been fixed as 128.
Momentum is used to overcome the noisy gradient or bounce the gradient by accelerating the movement of search in a direction to build inertia (constant movement). For the optimal performance of the gradient, descent momentum is used. It generally reduces the error and speeds up the performance of the learning algorithm. It prevents hedging of the optimization process. It helps the neural nets get out of the local minima point to find the global minimum. Mostly the momentum values are chosen close to 1 such as 0.9 or 0.99. In the proposed CNN architecture, the value of momentum has been fixed empirically as 0.9.
The learning rate is augmented or abated concerning the error gradient by adjusting the neural network's weights. It is responsible for a stable and smooth training and is the most important hyperparameter of gradient descent. Learning rate effects, the accuracy, and time required to train the model. The learning rate lies between 0.0 and 1.0. In the proposed CNN architecture, the learning rate value has been used as 0.001.
The proposed light-weighted CNN architecture is shown in
To have subsamples from the video given to the system, we extracted frames to discard one frame of two consecutive frames. So, a total of 1,09,440 data frames (51,840 depressed and 57,600 normal) were obtained from 38 subjects’ data. Preliminary, the experiments for depression detection from the whole frames were performed using pre-trained state-of-the-art CNN-based models, i.e., VGG16, VGG19, Alex-net, and Google-net and the proposed CNN model, but the model got over-turned due to the noise present in the frames. To remove noise, these frames are then incorporated into the OpenFace tool for aligned faces, 2D landmarks, AUs and gaze, pose, and landmarked features detection and extraction.
Aligned faces have been used for depression detection using the proposed and state-of-the-art CNN modes, i.e., VGG16, VGG19, Alex-net, and Google-net. AUs, gaze, pose, and landmarked based features have also been used to train a model using SVM. Moreover, to analyze, all these features were combined to verify their applicability to depression detection.
70% of the data is used for training while the rest of 30% is divided equally into testing and validation. The results of all experiments are shown in
Classification through | Features | Model trained through | Parameters (millions) | Testing accuracy | Validation accuracy | Remarks |
---|---|---|---|---|---|---|
Features | Pose (6 features) | SVM with fine gaussian | - | 96.3 | 95.4 | Validated |
Gaze (288 features) | SVM with fine gaussian | - | 83.2 | 82.1 | Validated | |
AUs (35 features) | SVM with cubic kernel | - | 91.2 | 90.0 | Validated | |
Landmark features (136 features) | SVM with fine gaussian | - | 99.9 | 98.5 | Validated | |
Combined (Pose + Gaze + AUs + Landmark) | SVM with cubic kernel and fine gaussian | - | 100 | 100 | Validated | |
Frames | “Noisy” | VGG16 | 138 | 100 | 100 | Over-tuned |
frames of | VGG19 | 144 | 100 | 100 | Over-tuned | |
the dataset | Alex-net | 61 | 100 | 100 | Over-tuned | |
Google-net | 7 | 100 | 100 | Over-tuned | ||
CNN based model | 6.5 | 100 | 100 | Over-tuned | ||
Frames | VGG16 | 138 | 97.56 | 94.67 | Validated | |
having | VGG19 | 144 | 99.0 | 96.0 | Validated | |
aligned | Alex-net | 61 | 98.10 | 96.23 | Validated | |
faces | Google-net | 7 | 95.46 | 93.57 | Validated | |
The proposed CNN model | 6.5 | Validated |
When aligned facial images were classified through VGG16, VGG19, Alex-net, and Google-net obtained 94.67%, 96%, 96.23%, and 93.57% validation accuracy. This is also reasonable; however, all these pre-trained models have a large volume of learnable, i.e., parameters, which significantly increases the training time of a respective model. However, the proposed model consumes only 16 layers and 6.5 million parameters. Therefore, it could be considered as light-weighted and requires lesser training time, achieving 99% validation accuracy. The accuracy obtained through the proposed CNN model is significantly high compared to the pre-trained model. Therefore, it is more favorable for real-time depression detection from facial images. The accuracy and loss plots and confusion matrix of the results is shown in
These results obtained through the proposed model are also compared with the recent studies of depression detection from videos or imagery data, as shown in
Related reference | Dataset used | Clinical methods | Methodology | Evaluation measures |
---|---|---|---|---|
[ |
57 depressed patients | Interviews were conducted using the Hamilton Rating Scale for Depression | ZFace + Stacked Denoising |
Accuracy (78.67%) |
[ |
AVEC-2013, AVEC-2014, Audio-Visual depressive language corpus | Depression Videos, BDI-II | Convolutional 3D network + RNN | MAE (7.22) |
[ |
AVEC-2016 | Audio video based clinical trials | DCNN + PV-SVM + Histogram of Displacement Range | F1 (89%) |
[ |
AVEC-2013 and AVEC-2014 | Clinical interviews/Questionnaires | Median Robust Local Binary Patterns from Three Orthogonal Planes + Dirichlet Process Fisher Vector | MAE (7.55) |
[ |
AVEC-2013 and AVEC-2014 | Depression Videos | Spatio-Temporal Attention network + Multimodal Attention Feature Fusion + Eigen Evolution |
MAE (6.14) |
[ |
AVEC-2013 and AVEC-2014 | Depression Videos | Inception-ResNet-v2 network + CNN + Bi-LSTM + Temporal median pooling + Deep spatiotemporal network | MAE (7.04) |
[ |
Locally developed | Depression Videos | Clinical methodology (self-rating depression scale) + 3DCNN | Accuracy (92.50%) |
The proposed model | Locally developed | Clinical interviews/Questionnaires | SVM + CNN | Accuracy |
In the research [
In another paper [
Moreover, the framework captures the facial motion using VLDN based dynamic feature descriptors. The feature map obtained from VLDN is nurtured into a CNN to obtain more discriminative features. It has been evaluated as per mean absolute error (7.04) and root mean square error (8.93). This study is with a reasonable measure of errors. This research [
The proposed study refers to the employment of own designed CNN model, previously developed state-of-the-art CNN-based models, and SVM. Experiments have been performed in a variety of ways. The whole dataset containing noise is used to classify through pre-trained models and the reported one. Here, in both ways, the model gets overturned. Therefore, the dataset is incorporated into the OpenFace tool for the aligned faces and feature extraction. The aligned faces are then used for training the proposed model that achieved 99% validation accuracy. Moreover, the extracted features have also been fed into the SVM model for training and testing with 100% accuracy.
Depression is becoming a significant health disorder nowadays. Facial AUs have major clues for the detection of depressive disorders. Many studies have been conducted to detect depression through these visual clues, but their models do not perform well as per evaluation measures. Therefore, we proposed a multimodal depression detection system based on CNN and SVM, trained on a locally developed dataset. The experiments revealed 99.9% validation accuracy on the proposed CNN model, while extracted features obtained 100% accuracy on SVM. Moreover, the results proved the superiority of the reported approach over state-of-the-art methods. The proposed light-weighted CNN-based model with fewer layers and parameters is perfectly tuned, providing robust performance. Therefore, this model is more suitable for real-time application of depression detection from videos captured through digital cameras even having low resolution.
In the future, we aim to further improve to detect rest of the psychological disorders in real time environment from gait, spoken language, and body movement Moreover, using electroencephalogram (EEG) sensor, it is also aimed to diagnose these disorders in their early stage.
The researchers would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.