Local-Tetra-Patterns for Face Recognition Encoded on Spatial Pyramid Matching

Face recognition is a big challenge in the research field with a lot of problems like misalignment, illumination changes, pose variations, occlusion, and expressions. Providing a single solution to solve all these problems at a time is a challenging task. We have put some effort to provide a solution to solving all these issues by introducing a face recognition model based on local tetra patterns and spatial pyramid matching. The technique is based on a procedure where the input image is passed through an algorithm that extracts local features by using spatial pyramidmatching andmax-pooling.Finally, the input image is recognized using a robust kernel representation method using extracted features. The qualitative and quantitative analysis of the proposed method is carried on benchmark image datasets. Experimental results showed that the proposed method performs better in terms of standard performance evaluation parameters as compared to state-of-the-art methods on AR, ORL, LFW, and FERET face recognition datasets.


Introduction
The face recognition field always stayed an active topic in the computer vision research area. Researchers have introduced many techniques that were used to increase the recognition accuracy with lesser time and lesser processing cost [1]. More specifically, during recent years, facial recognition and other biometric verification systems have developed greatly. In the current era, the dependency of protection on a single authorization method is out of the question [2]. Since the cyber threats and attacks are also getting stronger, the security methods can be improved with the provision of introducing multiple factor authorization. Facial recognition is a very promising The deep-learning-based approach for face recognition is a relatively new but fast-growing area [3]. Deep learning models can extract features themselves and learn to classify them based on extracted features. In recent years, several advancements have been made in deep learning techniques for face recognition. Although a tremendous amount of research has been done in deep learning and increasing a lot of accuracy in recent years. Yet, there are some problems in the implementation of these approaches [4]. Firstly, high computational power and training time is required for the implementation of such techniques [5]. Moreover, a huge dataset is required to achieve significant results. If a small to medium dataset is used for such techniques, there is a high chance of overfitting. Furthermore, the complexity of developing deep learning models is high. These approaches are also reported to face problems when the number of classes increases [6]. Outstanding face recognition techniques of many computer-vision-based recent studies [7,8] prove their proficiency in various challenging situations such as covered faces and pose variations, etc. However, the performance of these methods can further be improved through investigation.
In most of the existing techniques of the computer-vision-based approaches, face recognition is performed in three steps: pre-processing, selection of features, and classification [9]. The preprocessing step involves face detection, alignment, rotation, correction, scaling, and noise removal. The feature extraction process is concerned with extracting discriminatory facial features that play important role in the unique identification of faces. Finally, the extracted features are classified among database image features. Yet it is more interesting to perform robust face recognition without performing any preprocessing. Our research study is aimed to improve the existing technique proposed by Yang et al. [7], in which Local Binary Pattern (LBP) was used followed by multipartition max-pooling and Robust Kernel Representation (RKR).
We propose an efficient feature extraction technique using Spatial Pyramid Matching (SPM) and second-order Local Tetra Pattern (LTrP) features. This method suffices in handling all the aforementioned problems like pose variation, illumination changes, and covered faces, etc., rather than using extra preprocessing techniques. In our proposed methodology, firstly a pattern map of the input image is created using LTrP, then the image partitioning is performed using SPM, then the histograms of the resultant image are created, afterward, max-pooling is executed on the obtained histograms to create the final feature vector, the last step is the classification based on obtained feature vector. Experimental results on four benchmark datasets show the superiority of our approach over other state-of-the-art techniques. Our research contributes to the literature regarding face recognition by presenting a technique utilizing SPM and second-order LTrP. A combination of these techniques provides better results as compared to various existing computer-vision-based face recognition techniques.
The rest of the paper is organized as follows: Section 2 outlines a review of the latest techniques relevant to the proposed approach. In Section 3, the detail of the proposed methodology is presented. Experimental results are elucidated in Section 4. Lastly, Section 5 concludes the proposed method and presents future directions.

Related Work
Face recognition is a vast research field where researchers have put a lot of effort to form a better and smarter World. It has been researched a lot to bring a lot of techniques that have a high impact in this field. In this section, we have provided a review of existing techniques implemented for feature extraction, dimension reduction, and classification. LBP is a local face descriptor and can be used greatly in most of the image's feature extraction phase [10]. The concept of LBP was first introduced by Ojala where a 3 × 3 neighborhood of image pixels was considered while the center pixel of the image was compared with each pixel as a threshold [11]. The value of the center pixel is the decimal equivalent of the resulting binary number. In general, LBP is calculated by comparing the surrounding pixels from the referenced pixels and encoding the resulting bits 1 and 0 based upon the compared result. A binary pattern is obtained by setting the output matrix's pixel to 1 for those whose pixel value is greater than the center and the rest of the output values are set to 0. Then, a binary code is calculated by concatenating the numbers clockwise. The final LBP feature can be obtained by concatenating all the binary or decimal codes in a resultant matrix. Several variants of LBP have been proposed by researchers. Truong et al. [12] proposed a variant of LBP called 'weighted statistical binary pattern by direction.' In this technique, the descriptors used straight-line topologies along with different directions. By dividing the input image into mean and variance moments, and subsequently computing weighted histograms of the sign and amplitude components, robust facial feature representation was achieved. Local Derivative Pattern (LDP) is a variant of LBP, in which directional pattern features are encoded based on variations in the local derivative. It was first employed by [13] for the problem of face recognition. It works by extracting higher-order local information through the encoding of individual spatial relationships encompassed in the local region. After detailed experimentation, [14] concluded that LDP exhibited robustness under all noisy conditions, and it performed better with different illumination conditions and rotation angles as compared to LBP and its six other variants.
Local Ternary Patterns (LTP) were developed by Tan et al. [15]. LTP is an extended form of LBP. In LTP, a three-pair code is extended from LBP and a sign function s(x) is replaced by a three-pair function. In the resultant values, there are three types of values: positive, negative, and zero. The ternary code formed by this formation is calculated ideally from the top-left cell clockwise. As an example, it can form a code like 01(-1)011(-1)0. This matrix is further split up into two matrices separated by positive and negative values. After this, the individual histograms of both these matrices are calculated and then the final feature descriptor is obtained by concatenating all these histograms. The higher-order LTrP was first introduced by Murala et al. [16]. It is adopted by the idea of other local patterns, including LBP, LDP [13], and LTP [15]. The general use of LTrP is for texture analysis due to its improved ability to extract definite information from a particular image. It is calculated by taking the surrounding pixels from the referenced pixel by taking their first-order derivative in a horizontal and vertical direction. LTrP is very effective for content-based images. It is mostly used in Content-Based Image Retrieval (CBIR). It was found that the results obtained by second-order LTrP are better than higher-order LTrP because the sensitivity to noise increases as the order gets increased [17]. Mehmood et al. [18] presented a novel image representation technique based on the Bag of Visual Words (BoVW) model in which an image is partitioned into two rectangular regions and then histograms of those regions are calculated. During the construction of histograms, this spatial information is stored in the BoVW model. This representation of an image can be used as a feature in image classification and so it can be used for face recognition. For dimension reduction, pooling techniques are being used, such as sum pooling [19,20] and max-pooling [20][21][22]. Spatial pooling has been adopted widely in image classification for the extraction of invariant features meanwhile reducing dimensionality and processing cost. In face recognition, various researchers adopt pooling techniques as a way to enhance feature encoding [23,24]. Most used pooling methods include sum pooling, average pooling, and max-pooling. The maximum response is preserved using max-pooling whereas the average response is preserved by using average pooling. Many researchers proved the superiority of max-pooling over other pooling techniques for the face recognition problem. This preeminence of max-pooling is probably due to its robustness to local spatial variations [22,25]. The performance of each pooling method is a lot dependent on the block division. In this respect, an effective approach is a variable-sized multi-partitioning scheme [7].
Various classification techniques are proposed by researchers as the final step of face recognition. Most popular are the three approaches including Collaborative Representation-based Classification (CRC), Sparse Representation-based Classification (SRC), and kernel representationbased techniques. Both SRC and CRC are termed dictionary learning methods because the test samples are represented using a dictionary. In SRC, the training images are coded in a sparse matrix such that only the required elements or atoms of the images are taken, and the rest of the atoms (usually non-important) are discarded. Then the chosen elements play a much important role in the discrimination of test images from training images [26]. In CRC based technique, training images are coded collaboratively that represent the complete training set [27]. The classification in this technique is done by comparing the query image with the training set having minimal distance. The representation coefficient in this technique is generally obtained through l 2 regularization. Wang et al. [28] presented a comprehensive review of the facial feature extraction methods in which local features and discriminative representation was brought together. The review enlightened several common practices and motives to the facial recognition framework interchangings with exemplifications. Song et al. [8] used block-weighted LBP along with CRC for face recognition and achieved better results as compared to the simple CRC technique and variants of the SRC technique. Ma et al. [29] proposed a robust face classification technique using a spatial pyramid structure with weights. The traditional spatial pyramid structure evaluates each partition of the image equally which is less effective. This model adds weight to each partition based upon its self-adaptive method. The algorithm is robust against misalignments, pose variations, and expressions. The spatial pyramid matching technique is not robust against rotated images. In other words, it is not rotation invariant. Karmakar et al. [30] proposed a modified spatial pyramid matching technique that is robust against rotated images. The model is proposed using a weight function that plays important role in making the technique rotation invariant. The weighted spatial pyramid technique was proposed by Choi [31] that is based on the division and subdivision of multiple finer grain partitions on each level of the pyramid. After that, on each pyramid level, the calculated sum of each different partition in the partition set is used for recognition. The weights of the spatial pyramid are determined on each pyramid level using the discriminative power of the feature class. Kernel representation based classifiers provide a better way to classify the non-linearly separable features. It maps the features into a high dimensional feature space and then the linear classifiers can do a better performance to classify. Yang et al. [7] presented an RKR method that performed robust face recognition using statistical local features. In that technique, multi-partition max-pooling was used to extract the local features and the RKR method was used to exploit maximum discrimination information stored in local features. Occluded faces were handled by robust regression.

Proposed Methodology
The face recognition process in our methodology is divided into two phases. The first phase is feature extraction and formation. Initially, the input image undergoes the LTrP operation. Afterward, the SPM is used to convert the individual image into a different number of partitions. Finally, the histogram for each block is determined and the max-pooling method is applied over histogram features to produce a final feature vector. In the second phase of the methodology, RKR based classifier is employed for feature classification. This classification step identifies the class of the input image based on the analysis of extracted features. Results are compiled by the percentage of the correctly identified number of test images which is known as recognition accuracy. The individual sub-steps of the proposed methodology are discussed in the next subsections. Fig. 1 shows the basic methodological demonstration of the face recognition process.

Local Tetra Patterns Formulation
This is the first step of the feature extraction phase, where a pattern map of LTrP of the input image is calculated. The mathematical implementation is described below in detail. The basis for the development of LTrP is the preliminary local pattern which are LBP [32], LDP [33], and LTP [34]. LTrP shows a well-formed spatial structure of basic patterns taken from either textures or face data (as in our case). This spatial structure is derived from the direction of the central pixel. Consider an image I, the derivative operation of first-order on the directions of 0 and 90 degrees are denoted as I 1 θ |(g p )| θ =0 • , θ =90 • . Let us suppose that g c is the value of a central pixel in image I, and g v , g h denote neighboring pixels in vertical and horizontal position relative to g c respectively. The first-order derivative concerning central pixel g c can be expressed as: The central pixel's direction calculation can be performed by  The second-order LTrP is calculated by utilizing the values obtained from Eq. (3). It can be expressed as: where p is the pixel index (ranges from 1 to 8 in this case). And the function f 3 is defined in Eq. (5).
Using Eqs. (4) and (5), 8-bit tetra patterns are obtained for each central pixel. Since there are four values for four directions (1, 2, 3, 4) so four groups are formed in which each respective direction's tetra pattern is stored. These tetra patterns are then converted into binary patterns (three patterns for each direction). For a case, if the value of the central pixel I 1 Dir (g p ) is 1, then the second order LTrP can be evaluated by setting it apart into three binary patterns. The output of the LTrP feature is extracted using Eq. (6). 3,4 (6) where φ = 2, 3, 4. An illustration of the LTrP pattern over a face image can be visualized in Fig. 3.

Spatial Pyramid Matching
In the previous step, we calculated the pattern map of LTrP, now we will divide this pattern map into multiple partitions based on SPM and then calculate histograms as feature vector sets from each partition. These types of partitions will add advanced features to our proposed technique, making it rotation and pose invariant. This is because partitioning an image into multiple blocks will carry information from different regions of the image making it invariant to rotation and pose changes.
Initially, the Pyramid Matching (PM) kernel was used to perform matching feature collections [35]. This leads to the calculation of an intersection based on weighted histograms in the already existing multi-resolution histograms. But a major drawback of this approach is that it does not consider the spatial information about the images. More precisely, the resultant features lack discriminative spatial information leading to inaccuracy. Thus, SPM was proposed to consider spatial information.
SPM computes the histogram distribution over a diverse spatial resolution while considering images having the same dimensions. SPM kernel computation is performed by performing a matched sum of corresponding values in different channels of a feature. For m number of channels, the SPM kernel can be represented as: where the function k represents PM kernel and X c and Y c are histogram distribution of the feature c over all the spatial parts.
One of the key benefits of using SPM is that the image's spatial discrimination information can be robustly obtained. It is since SPM divides the image into multi-scale regions in different orders such as 1 × 1, 2 × 2, and 4 × 4 as shown in Fig. 4. These form a total of 21 blocks. We expect that the partition of the image with SPM to form 21 blocks can perform better when combined with LTrP. Fig. 4. The input image is partitioned as 1 × 1, 2 × 2, and 4 × 4 blocks. After that, features are extracted from each subblock and concatenated to form a feature vector set.

Figure 4: Illustration of SPM partitions and concatenation of calculated histogram features A visualization of SPM based on a sample image is demonstrated in
In every block of each partition, bins are prepared for storing pattern values of that block. This is the process of creating a feature collection in each block. There is a total of 21 blocks, so we get 21 feature vector sets. These features are concatenated together in another matrix for further processing.
In the next Section 3.3, we tried to reduce the dimensions of feature vectors and pick only better and use non-redundant values to enhance discriminative power by using max-pooling.

Feature Vector Formation
In the previous section, the concatenated histogram features contain much spatial information needed for classification it is still in our interest to reduce that information for a better outcome. The first benefit of reducing information is that the redundant information will be removed which does not benefit the output. The second benefit is that better information will make the feature more discriminative. And the last benefit is that the classification process is enhanced due to the smaller size of the feature vector. Suppose we have concatenated feature vectors as f c = { f 1, f 2, . . . , f 21}. These are 21 features from each block. Suppose that each block's feature ranges from 0 to 256. Then the pool of feature f c can be seen in Eq. (9): .
Here n is the total number of features concatenated and J is the total number of indices of each concatenated feature. Then, this pool of feature f c can be reduced by picking up the maximum value only from each column i.e., the j th column and finally we will get only one-row matrix which will be served as the final feature vector f in Eq. (10). The final feature f can be obtained as: We have chosen max-pooling because it performs better dimension reduction as well as keeping better discrimination information [13]. The next Section 3.4 is phase two of our technique which describes the classification method.

Robust Kernel Representation (RKR)
We used the RKR method for classification purposes. This is a much better technique for face recognition with local features [7]. Because the kernel-based techniques can do better mapping for features that cannot be separated nonlinearly, so it can add a lot of discriminative power in features, and then they can be classified better in feature space.

Pseudocode of Proposed Model
The algorithm of the proposed solution is also summarized in two phases. The first phase is feature extraction described in the algorithm in Tab. 1 and the second phase is classification described in Tab. 2. Feature extraction in algorithm Tab. 1 is based on three major steps. The first step is an initialization, in which the input query face image is taken for processing. The second step is calculations that are needed to get features. In calculations, the first LTrP as defined in Section 3.1 is calculated from the input image. LTrP is briefly described in Eq. (6). Then the calculated LTrP is partitioned into 3 partitions and 21 blocks for SPM calculation as defined in Section 3.2. The next step is about processing where each block from each partition is taken and a histogram is calculated for each block. Finally, max-pooling is applied for dimension reduction as defined in Section 3.3 to achieve the final feature vector as output.   Table 2: Face classification procedure of the proposed model Step 1: Initialization input feature vector calculated in Tab. 1 f = face feature input face dataset with identities: db = database Step 2: Classification Calculate identity of input feature from db images' features according to [7]: output id The algorithm in Tab. 2 is about classification where inputs are a feature vector f that needs to be classified, and a dataset of trained images. Classification is done by RKR [7] also described in Section 3.4. Finally, the identity of the input image is calculated as id. This identity represents the identity of that person who is recognized by the algorithm.

Datasets
We have tested our technique on four benchmark datasets including Aleix Martinez (AR) [36], Olivetti Research Ltd. (ORL) [37], Labeled Faces in the Wild (LFW) [38], and Face Recognition Technology (FERET) [39] dataset. The AR dataset consists of 120 subjects out of which we selected 100 subjects with 7 clear and non-occluded images of each subject as training data and two types of testing data occluded with glasses and scarf. There are 3 faces of each subject in testing data of both categories. There are also two sessions of each category. The AR dataset provides a challenging task to recognize faces with occlusion. An example of the AR dataset can be seen in Figs. 5 and 6.

Figure 5: Example of AR test dataset
The ORL dataset consists of 40 subjects with 10 face images in each subject making a total of 400 images. These images were taken at different time instances and consist of expression and illumination variations. We used the cross-validation technique to pick the training data and testing data. Here we also used the same division as 30% testing data and 70% training data. Some samples of the ORL dataset can be seen in Fig. 7.  The LFW dataset consists of 13,233 images having 5748 subjects. It is a challenging benchmarking dataset having a lot of images with huge diversity including pose variation, illumination variation, expression variation, makeup, and occlusions. This dataset is very huge and contains a different number of images for each subject. We have chosen a way to select those subjects which have at least 10 samples for proper testing and validation making a total of 157 subjects. After the selection of subjects, we then used the cross-validation train test split technique to split the training and testing data. 30% testing data and 70% training data were used for the LFW dataset experiment. Some samples of the LFW dataset can be seen in Fig. 8.

Figure 8: Example of LFW dataset
The FERET dataset comprises 14,126 images which are from 1,199 subjects. It was designed to evaluate the performance of state-of-the-art face recognition algorithms. The dataset has variations in terms of pose and expression. For extensive experimentation, we have taken 200 subjects with a maximum of 7 samples as train data of each subject. In this dataset, we also used the same cross-validation technique to mix the train data and testing data for appropriate results. An example of the FERET dataset can be seen in Fig. 9.

Evaluation Metrics
We have used a standard evaluation metric for the measurement of accuracy. The metric used is recognition rate (RR), which can be formulated as: where I C represents the number of testing images correctly identified and I T represents the total number of testing images. The recognition rate is a success percentage and can be useful for measurement in percentage.

AR Dataset Results
We have achieved competitive results against other state-of-the-art methods with much higher accuracy. In all 3 types of training samples, we achieved better results as compared to other methods. AR dataset results can be seen in Tab. 3.

ORL Dataset Results
Our method achieved improved results against three training samples on the ORL dataset as compared to other existing techniques. Yet for the 2 and 4 training samples, some methods showed better results as compared to our approach. Tab. 4 shows the results of face recognition on the ORL dataset.

LFW Dataset Results
Experimentation with the LFW dataset showed better results with our method as compared to existing techniques. The experimentation has been performed over 2, 3, and 4 training samples to perform the face recognition. Tab. 5 shows the comparison of results of our method with subject to LFW dataset.

FERET Dataset Results
Considering FERET Dataset, extensive experimentation has been performed by taking 3, 4, and 5 samples. The results obtained by our method outperform the previous techniques in terms of recognition rate. Tab. 6 shows the comparison of face recognition results by our method with previous methods for the FERET Dataset.

Discussions
In the AR dataset, the result of the proposed model has much higher accuracy as compared to other techniques. Using 3 train samples, we acquired 83.33% accuracy in terms of facial recognition. Using 5 training samples, we acquired 88.33% accuracy and using 7 train samples, we acquired 95.33% accuracy. These results are much better than other state-of-the-art methods in the literature. While in the ORL dataset results, the acquired accuracy could not be improved much higher. Using 2 train samples, it is 82.5% which is less than two other techniques, while using 3 train samples, the accuracy achieved is 90%, which is better than all other techniques. Yet, when using 4 train samples on the ORL dataset, an acquired accuracy of 90% is obtained which is less than 3 other techniques. This is because of some of the major variations of the pose in the dataset. These results can be improved by preprocessing dataset images by aligning them correctly to reduce pose variations. Regarding the LFW dataset, it is challenging since several complexities are added to the problem due to variations in terms of pose, expression, illumination, make-up, and occlusions. Using 2 train samples, we have achieved 17.77% accuracy and for 3 train samples, we acquired 23.13% accuracy. Using 4 train samples, the achieved accuracy is 26.25%. These results are better than other literary techniques. In the color FERET dataset, the results are much better than previous techniques. Using 3 train samples, the accuracy came out to be 74.53% whereas, using 4 train samples, the accuracy is 76.36%. Using 5 train samples, the accuracy is 80.13%. These obtained results are very promising as compared to previous literature techniques and therefore, the proposed method outperforms other state-of-the-art literature methods in terms of accuracy of facial recognition.

Conclusion and Future Work
In this research work, an efficient feature extraction approach and a classification technique have been utilized for robust face recognition. The process is composed of two main parts including feature extraction and classification. Initially, a feature descriptor LTrP is utilized to extract local patterns. Then, SPM is utilized to partition the patterns into 21 blocks. After that, max-pooling is utilized to construct features with the most discriminative information. Finally, the classification is done using an RKR method to fully exploit the discriminative power for robust performance. Due to the worldwide spread of COVID-19, people wear the mask in workplaces and public places. Hence there is a need for a robust face recognition system that recognize faces with and without mask accompanying with other factors such as illumination changes and poster changes etc. In the future, we would conduct extensive experiments in which eyes, as well as nose, are covered, to formulate a robust face recognition system. We intend to use an extended version of LTrP by modifying it in a way that it collects more powerful and discriminative information as a feature vector from the input image.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.