|Computer Systems Science & Engineering |
Affective State Recognition Using Thermal-Based Imaging: A Survey
School of Computer Sciences, Universiti Sains Malaysia, Penang, 11800, Malaysia
*Corresponding Author: Ahmad S. A. Mohamed. Email: firstname.lastname@example.org
Received: 10 November 2020; Accepted: 20 December 2020
Abstract: The thermal-based imaging technique has recently attracted the attention of researchers who are interested in the recognition of human affects due to its ability to measure the facial transient temperature, which is correlated with human affects and robustness against illumination changes. Therefore, studies have increasingly used the thermal imaging as a potential and supplemental solution to overcome the challenges of visual (RGB) imaging, such as the variation of light conditions and revealing original human affect. Moreover, the thermal-based imaging has shown promising results in the detection of psychophysiological signals, such as pulse rate and respiration rate in a contactless and noninvasive way. This paper presents a brief review on human affects and focuses on the advantages and challenges of the thermal imaging technique. In addition, this paper discusses the stages of thermal-based human affective state recognition, such as dataset type, preprocessing stage, region of interest (ROI), feature descriptors, and classification approaches with a brief performance analysis based on a number of works in the literature. This analysis could help beginners in the thermal imaging and affective recognition domain to explore numerous approaches used by researchers to construct an affective state system based on thermal imaging.
Keywords: Thermal-based imaging; affective state recognition; spontaneous emotion; feature extraction and classification
For a long time, human affects (emotion, facial expression) have acquired a great attention by researchers. For instance, in the 17th century, John Buwler wrote his book, “Pathomyotomia or Dissection of the Significant Muscles of the Affections of the Mind.” He demonstrates the mechanism of muscles on facial expressions . Charles Darwin  illustrates the expressions of human and animals by claiming that emotions are innate. Duchenne de Boulogne  describes how muscles produce facial expressions. He published facial expression pictures obtained from the electrical stimulation of human muscles. Ekman and Friesen  proposed a prototype comprises of six basic human emotions that encompass different cultures. Russel’s  mapped the categories of emotion into a valance and arousal dimensional model.
Recently, the recognition of human affective state becomes a crucial aspect in different domains and applications due to the sophistication of smart technologies, and growing of the power of processing units and artificial intelligent techniques. For example, the understanding of humans’ current state not only facilitates human to human communication, but can also increase human computer interaction (HCI). Another example is the assessments of patients’ needs who are unable to express their emotions, such as people who have autism disorder . In human robot interaction (HRI) , such a technique allows robots to understand and distinguish between negative and positive emotions . In security applications, the technique helps identify poker faces (people who have the ability to hide their emotions) . Additionally, it is also useful in stress detection applications to avoid bad consequences, such as cardiovascular diseases, cancer disease [10,11], and suicide phenomenon .
Psychological science mainly describes human affects in three main modalities; the categorical model, dimensional model, and appraisal model. In terms of categorical model, human affects can be described as a number of distinctive affective states called basic emotions and various theories have supported this type of emotion description. For example, Mowrer  claims that only two unlearned basic emotions exist, which are pleasure and pain. Frijda  introduced six basic facial expressions that can be formed by a reading process. Ekman  introduced six basic emotions, which can be recognized universally. Conversely, a dimensional model describes human affects in a dimensional way. For example, the study of Russell  demonstrates human affects in a 2D-space, which are valance (miserable to gland) and arousal (sleepy to aroused). The third category of human affect is represented as an appraisal-based emotion. In fact, the appraisal approach is a psychological theory, which categorizes human affects by the evaluation (appraisal) of the events, which is a specific reaction. Therefore, emotion could be elicited by appraisal (evaluation) of events . For example, Orton and Clore  explain appraisal theories and how cognition shapes influence human emotions.
Human affects are influenced by several factors, and it can be revealed through a set of human channels, such as voice, gesture, pose, gaze direction, and facial expressions . Psychology sciences classify human communication into verbal and nonverbal categories, of which the nonverbal communication comprises signals, such as kinesics, proxemics, haptics, physical appearance, and paralanguage. Kinesics behavior represents gestures, motion, eye gaze, posture, and facial expressions, while paralanguage represents the voice, volume, tone, pitch, and rate of a speaker [19–21]. Facial expressions influence the spoken message about 55%, while the influence of spoken words is only 7% . Facial expressions consist of primitive and nonverbal language, which represent the basic unit of meaning, and it is identical to words in spoken languages . On the other hand, emotions have a significant role in human affective states and represent the stimulations process to conscious or unconscious awareness for events or objects attached to a mental state . Due to the important role of face to convey human affects, Ekman and Friesen  introduced an anatomical tool to describe the movement of facial muscles during emotion called facial action coding system (FACS) and the muscles movement is described as an action unit (AU).
In addition to psychological behaviors, past studies have also focused on physiological signals in studying human affects. In fact, the nervous system in human comprises two main parts, which are central nervous system (CNS) and peripheral nervous system (PNS). The PNS also consists of autonomic nervous system  and somatic nervous system (SNS) . Considering ANS, it can regulate numerous physiological processes, such as heart rate, breathing rate, digestion, blood pressure, body temperature, and metabolism. Consequently, human behaviors and affects can be reflected from physiological signals [27,28]. The following physiological sensors have been employed by researchers, such as electromyograph (EMG), electrocardiogram (ECG), electroencephalogram (EEG), Galvanic skin response (GSR), and photoplethysmogram (PPG).
In most recent decades, the literature seems to reflect the concern about constructing various types of automatic human affective state recognition modalities. These modalities can be categorized based on the type input source of the model. For example, a large number of studies have focused on visual-based (RGB imaging) facial emotion recognition, while other studies selected thermal-based (infra-red thermal imaging) facial emotion recognition. Several human affective state recognitions based on voice, gestures, and physiological signals have been obtained. Several reviews and survey studies have been conducted in recent years. For example, the multimodal affective state recognition can be referred to [29–35]. In the physiological-based domain, readers can refer to [27,36–40], for a visual-based facial affective state recognition [41,42], and for thermal-based facial affective state recognition, readers can refer to [43–47]. This study focuses on the thermal-based affective state recognition. The goal of this study was as follows: First, to introduce the important role of thermal imaging in the affective state recognition. Second, to review databases that are used in the thermal affective state recognition and its challenges. Third, to discuss the general stages of thermal-based affective state recognition, such as preprocessing, region of interest (ROI), feature extraction, and classification. The final goal of this survey was to propose a brief performance analysis based on numerous studies to explore the performance of feature descriptors, classification algorithms, and databases, which were used by past studies. The organization of this paper is as follows: Section 2 discusses the thermal-based affective state recognition, while Section 2.1 reviews the types of thermal databases and its challenges. Section 2.2 presents the preprocessing stage based on the literature’s perspective, and Section 2.3 discusses the facial region of interest (ROIs) and its correlation with human affects. Section 2.4 discusses numerous feature extraction methods with brief performance analyses. Section 2.5 reviews classification approaches with a brief performance analysis. Finally, Section 3 concludes this study.
2 Thermal-Based Affective State Recognition
Recently, studies have increasingly adopted the thermal imaging in the affective state recognition due to several factors. In the past, the technology of thermal camera was not feasible because of its low resolution, high cost, heavy weight, and the need to control environment for a stable ambient temperature . The advancement of thermal devices produced new categories of portable and flexible thermal sensors in lightweight, low cost price, and with high resolutions, like mobile thermal sensors. Therefore, the sophisticated thermal sensors motivate researchers to explore thermal imaging inside laboratories and in real world environments for several application, such as human stress recognition , and physiological monitoring, such as respiration rate and heart rate . Moreover, the Covid-19 pandemic has increased the popularity of thermal sensors to detect people’s temperatures in a contactless and noninvasive way. Another factor that reveals the importance of thermal imaging is the nature of human affect. From the psychophysiological perspective, ANS is responsible for regulating human physiological signals, such as heart rate, respiration rate, blood perfusion, and body temperature during human affect. Hence, thermal imaging becomes a potential solution to measure the facial transient temperature . Past studies have also used thermal imaging to detect another vital signals, which are correlated with human affect, such as respiration rate and pulse rate and it could overcome the challenges of contact-based and invasive physiological sensors [52–57]. When comparing between thermal-based imaging with visual-based (RGB) imaging, studies have reported several advantages of thermal images over visual. For example, visual-based methods reported challenges to distinguish between the original and fake human affective state, especially for those people who are skillful in disguising their emotions , visual-based systems are sensitive to an illumination change [59–61], the variety of human skin color, facial shape, texture, ethical backgrounds, cultural differences, and eyes could influence the accuracy of the affective state recognition rate [59,62]. Visual-based recognition approaches have used posed databases due to the lack and difficulties in constructing a spontaneous database, which means that it reduces the efficiency of recognition of the spontaneous affective state [63–65]. Moreover, visual-based imagining approaches provide an inaccurate recognition accuracy in uncontrolled environments . On the other hand, thermal-based imaging is robust against light conditions, which means that it can be used even in full dark environments. Thermal-based imaging can capture a temperature variation influenced by human affects, therefore, it can be adopted to differentiate between spontaneous and fake emotions. For example, studies have shown that the variation of temperature between the left and right side of the face and a temperature change in periorbital and nasal facial regions are correlated to human affective state , also, the variation of blood flow in periorbital region provides an opportunity to measure the instantaneous and sustained stress conditions . Past studies have also reported challenges obtained from thermal imaging, such as facial occlusion of glass opacity in individuals who wear eyeglasses. Consequently, the occlusion creates a challenge to read facial heat pattern. Head pose is another challenge that could influence recognition accuracy in thermal-based affective state recognition. Several studies proposed a solution to solve occlusion and head pose challenges, such as in the following studies [58,69–72]. Past studies have also used the thermal-based imaging as a complementary tool to overcome challenges of visual-based imaging and to enhance human affective state’s recognition accuracy by the fusing of features from both types of images, such as [69,73–77]. The thermal-based affective state recognition comprises numerous stages, such as dataset selection, preprocessing, region of interest (ROIs) selection, feature extraction (descriptors), and classification process. The section that follows reviews the abovementioned stages of affective state recognition and presents a brief performance analysis based on a number of works in the literature as demonstrated in Tab. 1
2.1 Thermal Dataset
Thermal databases can be divided based on the affective type into three main categories, such as posed, induced, and spontaneous databases . In posed databases, in past studies, participants were asked to act different types of emotions to collect posed affective state. Therefore, this type of emotion does not reflect true human affects. To elicit human affects, the participants were exposed to stimuli to induce their affects, while in the spontaneous type, the participants were not aware of the video recording of their acts that captured their images. Hence, the process of the construction of spontaneous databases is very difficult . More importantly, thermal-based affective state recognition systems require databases that comprise different factors, which play an important role on recognition accuracy, such as aligned and cropped faces, illumination variation, head poses, and occlusion . Due to the importance of thermal imaging in recognition of human affective state, several methods have been proposed to construct thermal-based databases. From the study of Wang et al. , a natural visible and infrared facial expression (NVIE) database was constructed. This database comprises both thermal and visible images acquired simultaneously from 215 participants. Also, the database was manually annotated and contained landmarks for spontaneous and posed expressions. In addition, facial images with and without eyeglasses were captured under different illumination conditions. For validation and assessment purposes, the authors have applied PCA, LDA, and AAM algorithms for visible expression recognition and PCA, and LDA for infrared expression recognition from the results of the thermal classification algorithms and analysis of correlation between facial temperature and emotions. The authors reported that NVIE database provides suitable features for the thermal affective state recognition process.
Kopaczka et al.  constructed a thermal database, which includes a number of features that facilitate the affective state recognition in thermal imaging as follows: Firstly, the database comprises spontaneous and posed expressions, and secondly, for occlusion challenge the database encompasses faces with and without spectacles to solve occlusion challenges, then, the authors captured faces with different positions to allow algorithms to deal with the head movement problems. In addition, the database consists of 68 manually annotated facial landmarks and numerous AUs for thermal recognition. Finally, the authors evaluated their database by applying the different types of features and classification algorithms, such as HOG, LBP, SIFT, SVM, KNN, BDT, LDA, NB, and RF classifiers. Their study reported the ability of the database to train machine learning algorithms. Nguyen et al.  constructed the KTFE facial expression database. This database focuses on natural and spontaneous expressions in thermal and visible videos. The authors claim that their database solves a number of problems found in other facial expression databases, such as time interval between stimuli and participants’ response. More importantly, thermal databases are very few compared to visual databases , and numerous challenges could influence the quality of data. For example, the number of annotated facial images is small, which limits the use of machine learning algorithms and the type of expressions in these databases are posed expressions [92,93]. Consequently, the rarity of thermal databases could limit the validation process and comparison between several thermal based affective state methods.
Due to the important role of spontaneous affective state, this paper conducted a brief analysis to show the percentage of spontaneous database compared to posed databases. The analysis relied on the study of Ordun et al.  as this study collected 17 thermal emotion databases. The percentage of abovementioned dataset types in their collected databases is as follows: Spontaneous database 17.6 %, posed database 58.8 %, spontaneous with posed database 11.8 %, and unknown type of databases is 11.8 %. This analysis may assert the challenge of collecting spontaneous human affects. As listed in Tab. 1, 50 % of the studies conducted their own experiment (handcraft) to collect induced affective state dataset rather than posed affects. The percentage of studies that used KTFE and UTC-NVIE dataset is 18.75 %. From this analysis, it can be concluded that past studies tended to use their own handcrafted dataset more than the public dataset and the reason could be linked to the rarity of spontaneous and induced databases and not all databases are available for public. For example, the percentage of databases, which require a permission from the owners to use and the ones listed in Ordun et al.’s  study is 52.9 %, while 11.7 % of databases is publicly available and 35.2 % of the listed databases have no information about publicity of use. Another important issue that could be helpful in selecting an appropriate database is by analyzing the performance of recognition affective state with respect to the type of dataset. Therefore, Fig. 1 demonstrates the average recognition accuracy results with respect to the dataset type. From Fig. 1, it can be noticed that the average recognition accuracy results obtained by using KTFE, dataset developed based on the literature (handcraft), Ref.  database, and UTC-NVIE database are 80.65%, 75.98%, 75.40%, and 62.63%, respectively.
2.2 Preprocessing Stage
Thermal-based images are different in structure than visual-based images due to its thermal temperature. Therefore, several characteristics of geometric, appearance, and texture of thermal images need different preprocessing methods for image enhancement, noise reduction, and for facial extraction. In the pre-processing stage, past studies have used several methods to enhance thermal images and to extract facial region. For example, Kopaczka et al.  applied HOG with SVM for face detection. Merhof et al.  implemented unsharped mask for image enhancement. The study used lowpass gaussian filter and subtracted the filtered image from the original image. Liu and Yin  proposed a unified model, which comprises a mixture of trees with shard pools of parts for face detection taken from  to allocate a facial region for the first frame, then, they used the first frame to calculate the head motion. In Wang et al. , they applied the Otsu thresholding algorithm to generate a binary image. Then, they calculated vertical and horizontal curves from the binary image and detected a face boundary from the largest gradient of projection curve. To improve the image contrast, Latif et al.  implemented the contrast limited histogram equalization (CLHE) on thermal images. Mohd et al.  applied the Viola-Jones boosting algorithm with Haar-like features to detect a facial region. The study applied a bilateral filter for noise reduction and facial edge preservation. Wan et al.  proposed a temperature space method to distinguish facial region from image background. Trujillo et al.  applied the bi-modal thresholding method to locate facial boundaries. Kolli et al.  proposed a color-based detection, region growing with morphological operation for face detection. Goulart et al.  extracted a facial region form thermal image by using the median and gaussian filters with a binary filter to convert the facial region into a pure black and white. Khan et al.  applied the median smoothing filter for blurring and noise reduction and sobel filter for edge detection. Cruz-Albarran et al.  proposed a thresholding value to extract a facial region from an image background. Mostafa et al.  proposed the tracking ROI method, of which their method is composed of an adaptive particle filter tracker to track the facial ROI. From these studies, it can be noticed that different methods have been proposed for thermal image enhancement and for facial extraction.
2.3 Region of Interest (ROI)
A facial temperature is influenced by human affects as an autonomic nervous system (ANS) responses to human affects, like emotions or mental trigger physiological changes, such as an increasing blood flow inside the vascular system. Hence, the human body temperature is regulated through the propagation of internal body heat to human skin. The variation of facial temperatures could be measured by using the IR-thermal . Another factor that influences facial temperature is the facial muscles contraction, which also produces heat that propagates to facial skin . However, the literature has mainly focused on numerous facial areas to measure human affects, such as forehead, eyes, mouth, periorbital, tip of nose, maxillary, cheeks, and chin [60,81,82,84,87,100]. Several studies have observed the variation of temperature in the facial ROI with respect to human emotion type. For example, the study of Ioannou et al.  reported that the temperature of the nose, forehead, and maxillary reduces in stress and fear affective state, and the periorbital temperature increases in the anxiety affective state. Their study demonstrates the temperature variation in facial ROI as shown in Tab. 2. Moreover, Jian et al.  have shown that cheeks and eyes have positively correlated with human emotion. Furthermore, Cruz-Albarran et al.  reported that the temperature of the nose and maxillary is reduced in fear, joy, anger, disgust, and sadness emotions, while the cheeks’ temperature increases in disgust and sadness affects. Also, Ioannou et al.  reported the decrease in the forehead’s temperature when in fear and sadness, and increases in an anger affective state. Tab. 2 demonstrates the temperature changes in facial ROIs within human affects, taken from .
2.4 Feature Extraction (Descriptors)
Features play an important role in the recognition human affects, especially in thermal images, which have a different texture, appearance, and shape than the RGB images. Past studies have used several types of thermal descriptors in their works. The performance of affective state recognition relies on the efficiency of facial features from either the whole face or facial parts (ROI). Tab. 1 in this survey demonstrates numerous types of feature extraction. From Tab. 1, it can be noticed that more than 50% of the studies have used statistical features. For example, Cross et al.  selected the mean value of temperature points, while Pérez-Rosas et al.  used an average temperature, overall minimum and maximum temperature, standard deviation, and standard deviation of minimum and maximum temperature. Nguyen et al. [69,88] applied the covariance matrix over facial temperature points. Liu and Yin  applied the histogram of SIFT features with the variation of facial temperature. Wang et al.  selected a number of statistical features in addition to 2D-DCT, GLCM, and DBM features. Basu et al.  applied the moment invariant with histogram statistic features, while Goulart et al.  applied numerous statistical features, such as mean emissivity, variance of emissivity, mean value of rows and columns, median of rows and columns, and another seven statistical features. Similarly, the study of Khan et al.  applied a number of statistical features, such as mean of thermal intensity values, covariance, and eigenvectors. Another important feature has also been paid attention to by researchers that is the gray level co-occurrence matrix (GLCM) [79,85,87]. GLCM is used to describe the texture information in thermal images and co-occurrence distributions, which represent the distance and angles of spatial relationship over the image. GLCM composes multiple directions during the analysis, such as vertical direction (0°), vertical direction (90°), and diagonal direction (45°) and (135°). Histogram of oriented gradient (HOG) is another facial descriptor used in thermal-based affective state recognition . The HOG descriptor counts the occurrence of gradient orientation from a local region in an image, by dividing the image into a number of sub-regions called cells and computes the histogram of gradient direction for each cell. Local Binary Pattern (LBP) is applied in thermal-based affective state recognition . This descriptor thresholds the neighborhood of pixels image by using 3 × 3 filter where the center of the filter is the pixel image, while the others represent pixel neighborhoods, and the threshold value is represented in the form of binary numbers. LBP operators have been updated by extending the pixel neighborhoods as a circular neighborhood to utilize the features from different scales . Different kinds of LBP have been established, for example, LBP for spatial domain, LBP for spatiotemporal domain, and LBP for face description. Scale-invariant feature transform (SIFT) and Dense SIFT have also been selected as facial descriptors in thermal-based affective state recognition [84,80]. The general idea of the SIFT descriptor is to describe the local features in an image based on Euclidian distance. Principle Component Analysis (PCA) is a statistical approach selected by past studies for dimensionality reduction [86,90] by converting a large set of variables to a small set, while preserving the information that exist in the large set. Principal Component Analysis (PCA) is another important descriptor. This descriptor transforms numerous possible correlated variables to small uncorrelated variables (principal components), in a facial expression recognition domain. PCA extracts the facial features from a set of images, which are global facial images and varies from mean face (eigenface). However, another several feature descriptors have also been utilized by past studies. For example, 2D-DCT was used with other descriptors in . 2DWT applied by  and Fuzzy Color and texture histogram (FCTH) were used in . This paper analyzed the performance of facial descriptors as follows: The first analysis was the maximum recognition accuracy result reported in Tab. 1, which is 98.6% obtained by combining 2DWT, GLCM, and LBP features as reported in . The next analysis represents the maximum feature selected by past studies. As shown in Tab. 1, about 50% of past studies used statistical features and reported an overall average recognition accuracy result of 75.91%. The reason of selecting statistical features could be due to the nature of thermal imaging, which motivated past studies to measure the variation of facial skin temperature rather than the RGB images, which have a better facial texture and shape.
Robust classification method is crucial in thermal-based affective state recognition in order to maintain a high accuracy of affective prediction. Consequently, several classifiers have been used for this purpose. As demonstrated in Tab. 1, it can be noticed that the SVM classifier has been used widely in past studies [58,61,79,80,84–87]. The second classifier focused by past studies is LDA [81,89]. In addition, KNN is another important classifier [82,102]. PCA, EMC, and PCA-EMC have been used by [60,88,90]. AdaBoost was also selected by . Deep Boltzmann machine (DBN) was used in . The Modified Hausdorff distance (MHD) was applied in . Tree complex classifier was implemented in . This paper analyzed the performance of numerous classification algorithms. The maximum accuracy results is 99.1% reported in . The classifier used in this study was weighted KNN. Moreover, the average accuracy result reported from all literatures is 76.69%. Fig. 2 demonstrates the analysis of classifier recurrent with its average recognition accuracy. From this figure it can be noticed that the classifier that was more frequently used in past studies is SVM with 28% of recurrent and reported higher average recognition accuracy results, which is 83.05%.
This paper discusses the importance of thermal-based imaging in affective state recognition. Thermal images have important characteristics that could help researchers to overcome well-known challenges of other techniques like visual-based affective state recognition and physiological-based affective state recognition. For example, thermal images are robust against illumination changes and it captures the facial transient temperature, which reflects human inner affects. Moreover, past studies have used thermal images to measure important vital signals, such as pulse rate and respiration rate in a contactless and noninvasive way. Therefore, researchers have recently paid their attention to the thermal-based imaging technique as a potential and supplemental solution to the challenges of affective state recognition. This paper also discusses general stages of affective state recognition based on thermal imaging and conducted a brief performance analysis. Based on the analysis of this paper, 50% of past studies have conducted their own dataset and the reason could be due to the rarity of spontaneous and induced databases, and 52.9% of the public databases require the permission to use. This paper also reported that 50% of past studies applied statistical features with an overall recognition accuracy of 75.91%. In the classification stage, the analysis of this paper shows that SVM is the classifier that have been more frequently used by past studies with 28% recurrent of use and an overall average recognition accuracy of 83.05%.
Funding Statement: This research was funded by the research university grant by Universiti Sains Malaysia [1001.PKOMP.8014001].
Conflicts of Interest: The authors declare that they have no conflicts of interest to report the findings of the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|