Prediction of Changed Faces with HSCNN

: Convolutional Neural Networks (CNN) have been successfully employed in the field of image classification. However, CNN trained using images from several years ago may be unable to identify how such images have changed over time. Cross-age face recognition is, therefore, a substantial challenge. Several efforts have been made to resolve facial changes over time utilizing recurrent neural networks (RNN) with CNN. The structure of RNN contains hidden contextual information in a hidden state to transfer a state in the previous step to the next step. This paper proposes a novel model called Hidden State-CNN (HSCNN). This adds to CNN a convolution layer of the hidden state saved as a parameter in the previous step and requires no more computing resources than CNN. The previous CNN-RNN models perform CNN and RNN, separately and then merge the results. Therefore, their systems consume twice the memory resources and CPU time, compared with the HSCNN system, which works the same as CNN only. HSCNN consists of 3 types of models. All models load hidden state h t − 1 from parameters of the previous step and save h t as a parameter for the next step. In addition, model-B adds h t − 1 to x, which is the previous output. The summation of h t − 1 and x is multiplied by weight W. In model-C the convolution layer has two weights: W 1 and W 2 . Training HSCNN with faces of the previous step is for testing faces of the next step in the experiment. That is, HSCNN trained with past facial data is then used to verify future data. It has been found to exhibit 10 percent greater accuracy than traditional CNN with a celeb face database.


Introduction
Face recognition (FR) systems have been continually developed for personal authentication. These efforts have resulted in FR applications acting on mobile phones [1]. Researchers have proposed several ideas for FR systems: eigen faces [2], independent component analysis [3], linear discriminant analysis [4,5], three-dimensional (3D) method [6][7][8][9], and liveness detection schemes to prevent the misuse of photographic images [10]. Using data acquisition methodology, Jafri et al. [11] divided FR techniques into three categories: intensity images, video sequences, and 3D or infra-red techniques. They introduced AI approaches as one of the operating methods for intensity images and reported that it worked efficiently for somewhat complex FR scenarios. Such techniques had not previously been utilized for practical everyday purposes.
In 2012, AlexNet [12] was proposed and became a turning point in large-scale image recognition. It was the first CNN, one of the deep learning techniques, and was declared the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 with 83.6% accuracy. In ILSVRC 2013, Clarifai was the winner with 88.3% [13,14], whereas in ILSVRC 2014, GoogLeNet was the winner with 93.3% [15]. The latter was an astonishing result because humans trained for annotator comparison exhibited approximately 95% accuracy in ILSVRC [16]. In 2014, using a nine-layer CNN, DeepFace [17] achieved 97.35% accuracy in FR, closely approaching the 97.53% ability of humans to recognize cropped Labeled Faces in the Wild (LFW) benchmark [18]. However, DeepID2 [19] achieved 99.15% face verification accuracy with the balanced identification and verification features on ConvNet, which contained four convolution layers. In 2015, DeepID3 [20] achieved 99.53% accuracy using VGGNet (Visual Geometry Group Net) [21], whereas FaceNet [22] achieved 99.63% using only 128-bytes per face.
CNN consists of convolution layers, pooling layers, and fully connected layers. However, a number of problems still needed to be addressed. For instance, CNN trained with past images failed to verify changed images according to a time sequence. In their in-depth FR survey, Wang et al. [23] described three types of cross-factor FR algorithms as challenges to address in real-world applications: crosspose, cross-age, and makeup. Cross-age FR is a substantial challenge with respect to facial aging over time. Several researchers have attempted to resolve this issue. For instance, Liu et al. [24] proposed an age estimation system for faces with CNN. Bianco et al. [25] and Khiyari et al. [26] applied CNN to learn cross-age information. Li et al. [27] suggested metric learning in a deep CNN. Other studies have suggested combining CNN with recurrent neural networks (RNN) to verify changed images because RNN can predict data sequences [28]. RNN contains contextual information in a hidden state to transfer a state in the previous step to the next step, and has been found to generate sequences in various domains, including text [29], motion capture data [30], and music [31,32].
This paper proposes a novel model called Hidden State-CNN (HSCNN) as well as training the modified CNN with past data to verify future data. HSCNN adds to CNN a convolution layer of the hidden state saved as a parameter. The contributions of the present study are as follows: First, the proposed model, HSCNN, exhibits 10 percent greater accuracy than traditional CNN with a celeb face database [33]. Facial images of the future were tested after training based on facial images of the past. HSCNN adds the hidden state saved as a parameter in the previous step to the CNN structure. Further details on this process are provided in Section 4.2.
Secondly, because HSCNN included the hidden state of RNN in the proposed architecture, it was efficient in its use of computing resources. Other researchers have performed CNN and RNN, separately, and merged the results in the system, consuming double the number of resources and processing time. Further details are presented in Section 2.
Thirdly, this paper indicates that HSCNN can train with only two images of one person per step. Also, the loss value reached 0.4 in just 40 epochs in training with loading parameters and 250 epochs in training without loading parameters. Therefore, the HSCNN achieves efficiency because it uses only two images and trains in 40 epochs with loading parameters. This is explained further in Section 4.1.
In the remainder of this paper, Section 2 introduces related works, Section 3 outlines the proposed method, Section 4 presents the experimental results, and Section 5 provides the conclusion.

Related Works
Some neural network models can acquire contextual information in various text environments using recurrent layers. Convolutional Recurrent Neural Networks (CRNN) works for a scene text recognition system to read scene texts in the image [34]. It contains both convolutional and LSTM recurrent layers of the network architecture and uses past state and current input to predict the subsequent text. Recurrent Convolutional Neural Networks (RCNN) also uses a recurrent structure to classify text from document datasets [35]. The combined CNN and RNN model uses relations between phrases and word sequences [36] and in the field of natural language processing (NLP) [37].
Regarding changed images, combining CNN with RNN methods have been proposed in image classification [38] and a medical paper for blood cell images [39]. These authors merged the features extracted from CNN and RNN to determine the long-term dependency and continuity relationship. A CNN-LSTM algorithm was proposed for stock price prediction according to leading indicators [40]. This algorithm employed a sequence array of historical data as the input image of the CNN and feature vectors extracted from the CNN as the input vector of LSTM. However, these methods used CNN and RNN, separately, and merged the results vectors extracted from CNN to RNN. Therefore, their systems consume twice the memory resources and CPU time, compared with the proposed system, which works the same as CNN only. Fig. 1 presents an overview of the models developed by Yin et al. [38] and Liang et al. [39].

Proposed Method: Hidden State CNN
The following subsections explain the equation of the loss function cross-entropy error and optimizer of the stochastic gradient descent used in the proposed method. And then, 3 types of models of Hidden State CNN are described.

Loss Function and Optimizer
In HSCNN, experiments indicated that cross-entropy error (CEE) was the appropriate loss function (cost function or objective function). The best model has a cross-entropy error of 0. The smaller the CEE, the better the model. The CEE equation is: where t k is the truth label, and y k is the softmax probability for the k th class. When calculating the log, a minimal value near the 0.0 delta is necessary to prevent a '0' error. In the python code, the CEE equation is: Also, the optimizer is the stochastic gradient descent (SGD), which is a method for optimizing a loss function. The equation is: where W is the weight, η is the learning rate, and ∂L ∂W is the gradient of the loss function L for W.

Hidden State CNN (HSCNN)
HSCNN consists of 3 types of models: model-A, model-B, and model-C. Model-A is the same as I-CNN. Fig. 2 presents the overall structure of HSCNN, which is the modified AlexNet with hidden state h t of Convolution layer 5. HSCNN consists of convolution layers, RelU function, pooling layers, affine layers, and soft max function such as traditional CNN, but it also has hidden states: h t−1 and h t . This distinguishes HSCNN from CNN.  Fig. 4 presents the structure of model-B. Model-B loads hidden state h t−1 from parameters of the previous step and saves h t as a parameter for the next step. The model adds h t−1 to x, which was the previous output. The summation of h t−1 and x is multiplied by weight W. The activation function of the rectified linear unit (RelU) equation of model-B is:

Experimental Results
In the experiment, HSCNN used Cross-Entropy Error as the loss function, Stochastic Gradient Descent (SGD) as the optimizer, and the Instruction rate (Ir) was 0.001. HSCNN also used python, NumPy library, and CuPy library for the NVIDIA GPUs employed for software coding.

Preparations for Experiments
The essential items required for the experiments are listed in Tab. 1. The experiments used FaceApp Pro to make old faces from young faces and Abrosoft Fanta Morph Deluxe to generate 100 morphing facial images between young and old faces. The facial images of 1,100 persons were selected from the celeb face database, known as the Largescale CelebFaces Attributes (CelebA) Dataset. Those faces were then paired with a similar face; for example, man to man, woman to woman, Asian to Asian, and so on. One was chosen as the young face between paired faces and the other was changed to the old face. To make these changes, the photo editing application FaceApp Pro was used. Between the transition from young faces to old faces, 100 continually changing faces were created using the photo morphing software FantaMorph.
The difference between the young and old images of the same person made by FaceApp Pro was not sufficient, so the experiments also paired the faces of two other persons. Thus, the young face and old face of each pair were different persons. Finally, 102 images changing from young to old were created for all 550 paired images. Number 1 is the youngest image, and number 102 is the oldest. 102 images show the aging state over time from young to old. Number 1 is the youngest image, and number 102 is the oldest. 102 images show the aging state over time from young to old. Numbers 1 to 10 are young images used as the first training data, and the last numbers 101 to 102 are old images and used as final target images. Using 90 images from numbers 11 to 100, the experiments used 10 steps to learn and test aging changes over time. When learning using 9 images in each of 10 steps, all 90 images are used. However, in this case, learning occurs on the entire data, and the training result of the proposed model or CNN was almost the same. So the prediction experiments become meaningless. Therefore, to clearly confirm the results of the prediction experiment, two images are trained for each step, and the two images mean a specific point in the middle of facial aging. This is the same as testing the face at the last oldest point with two final target images. Tab. 2 presents sample images from the dataset and their numbers for each step: primary step, target step, and step 1 to step 10. Beyond primary step, each step has only two samples. There is a substantial difference between images number 60 to number 90; however, after number 90, all facial images look almost identical. HSCNN appears to be an extremely efficient method as it achieved 99.9% accuracy and 0.01 loss value with only two training images in each step from step 1 to step 10. Also, the HSCNN is trained with loading parameters saved in the previous step. The use of loading parameters means the training epochs can be shortened. Fig. 6 indicates that the loss value reaches 0.4 in 250 epochs in training without loading parameters and just 40 epochs with loading parameters. Thus, the HSCNN achieves efficiency because it uses only two images and trains in 40 epochs with loading parameters.

Experiments
The experiments utilized AlexNet as the traditional CNN and created three models of HSCNN modifying AlexNet. All models added hidden states to the convolution five-layer and used the RelU 5 layer of the AlexNet.  The experiments consisted of 2 groups: experiment I, experiment II. Experiment I tested the target step, and experiment II tested step 3 (behind the training step).
Step 1 employed the parameters saved in the primary step, while step 2 used the parameters saved in step 1.
Tab. 3 presents the accuracy of the target step in testing faces. Tab. 4 presents the accuracy of step 3 behind, according to the models, in testing faces. The test of step 3 behind means that HSCNN tested step 4 when training step 1, and step 5 when training step 2.   Fig. 8 depicts the differences between model-A and AlexNet. These indicate that model-A achieved more than 10% higher accuracy in step 3 to step 5 of experiment I and in step 1 to step 3 of experiment II. Therefore, model-A can be employed for both long-and short-time changes. Fig. 9 depicts the differences between model-B and AlexNet. These indicate that model-B achieved more than 10% higher accuracy in step 4 to step 6 of experiment I and in step 1 to step 2 of experiment II. Therefore, model-B can be employed to predict both long-and short-time changes. Fig. 10 depicts the differences between model-C and AlexNet. These indicate that model-C achieved more than 10% higher accuracy in step 2 to step 3 of experiment II, but not in any step of experiment I. Therefore, model-C is considered an appropriate method to predict short-time changes.   This paper proposes a novel model, Hidden State-CNN (HSCNN), which adds to CNN a convolution layer of the hidden state. Training with past data to verify future data was the aim of the experiments, and HSCNN exhibited 10 percent higher accuracy with the celeb face database than the traditional CNN. The experiments consisted of 2 groups. Experiment I tested the target step and experiment II tested step 3 behind the training step. The latter step means that step 4 was tested when training step 1. Experiment I assessed long-time changes, and experiment II relatively shorttime changes. Both yielded similar results. Furthermore, HSCNN achieved a more efficient use of computing resources because its structure differed from that of other CNN-RNN methods. HSCNN also represented a new and efficient training process to verify changing faces. It used only two training images in each step and achieved 99.9% accuracy and 0.01 loss value.

Conflicts of Interest:
The author declares that he has no conflicts of interest to report regarding the present study.