Semantic Human Face Analysis for Multi-level Age Estimation

Human face is one of the most widely used biometrics based on computer-vision to derive various useful information such as gender, ethnicity, age, and even identity. Facial age estimation has received great attention during the last decades because of its influence in many applications, like face recognition and verification, which may be affected by aging changes and signs which appear on human face along with age progression. Thus, it becomes a prominent challenge for many researchers. One of the most influential factors on age estimation is the type of features used in the model training process. Computer-vision is characterized by its superior ability to extract traditional facial features such as shape, size, texture, and deep features. However, it is still difficult for computers to extract and deal with semantic features inferred by human-vision. Therefore, we need somehow to bridge the semantic gap between machines and humans to enable utilization of the human brain capabilities of perceiving and processing visual information in semantic space. Our research aims to exploit human-vision in semantic facial feature extraction and fusion with traditional computer-vision features to obtain integrated and more informative features as an initial study paving the way to further augment the outperforming state-of-the-art age estimation models. A hierarchical automatic age estimation is achieved upon two consecutive stages: classification to predict (high-level) age group, followed by regression to estimate (low-level) exact age. The results showed noticeable performance improvements, when fusing semantic-based features with traditional vision-based features, surpassing the performance of traditional features alone.


Introduction
Humans have a great inherent ability to perceive visual signals around them in the environment using their eyes and beyond to analyze those signals by their brains to make some inferred decisions or actions based on their background, knowledge, and experience [1]. Nowadays, the continuing advances in

Traditional Computer Vision-based Age Estimation
Several different computer vision methods have been investigated and used for facial age estimation implicating either extracting hand-crafted features or learning deep features via Conventional Neural Network (CNN/ConvNet) architectures. The anthropometric model is focused on some measurements between the main facial components. A set of landmark points is identified on the eyes, lips, nose, eyebrows, ears, chin, and forehead. Then, a number of measurements, like axial distance, the angle between components, shortest distance, tangential distance, angle of inclination, and ratios of distances can be calculated among face components [9]. In [10,11] the authors relied on anthropometrical measurements in the process of age group classification using Artificial Neural Networks (ANNs), while in Thukral et al. [12], a hierarchical approach was applied in the process of human age estimation using 2D landmarks as shape features to train Support Vector Machine (SVM) and Support Vector Regression (SVR).
Active Shape Model (ASM) is a statistical model used to obtain descriptive information about face shape via a collection of facial landmark points [13]. In the age estimation process, other models are often combined with ASM in order to obtain more accurate description for a human face. The method used in [14], fuses ASM, interior angle formulation, anthropometric model, carnio-facial development, heat maps, and wrinkles together to extract aging features. Then, a CNN model is used to classify images among different age groups. Another enhanced prediction model was developed in [15] by combining both texture and shape features, and applying Partial Least-Square (PLS) for dimensionality reduction.
Age determination can be successfully performed using a poly regression model. Active Appearance Model (AAM) is used to extract features in many research explorations related to face images and was exploited by different researchers like Shejul et al. [16] to extract facial features for human age estimation task. It is worth noting that AAM is a statistical model distinguished from the ASM, such that it combines two different models for modeling shape and gray-level appearance. A wrinkle model was combined with AAM in [17] in order to enhance the results of basic AAM, and a CNN regressor was trained for exact age prediction. In [18], AAM, Local Binary Patterns (LBP), and Gabor Wavelets (GW) Figure 3: Sketch of changes occurring in the adult face with age progression [8] Figure 2: Facial growth in children and young people with age progression [7] algorithms were fused together for features extraction and a novel hierarchical approach was implemented with two consecutive regressors. A skin spots feature was added in [19] using Local Phase Quantization (LPQ) algorithm. However, this feature gave a slightly negative effect on the results due to the variation of lighting among images in the used datasets.
AGing pattErn Subspace (AGES) was proposed by Geng et al. to be used in automatic age estimation [20]. They defined the aging pattern as a sequence of facial images of a specific person, and those images must be ordered according to the age series. Hence, rather than using independent face images, the AGES approach considered each aging pattern as one sample. Also in AGES, each image is in one specific location on the axis t which refers to the time. If a person's picture is available at a certain age, then it is placed in the appropriate place in the aging pattern. Otherwise, if the image is not available, its location remains empty. Each image is transformed using AAM into a vector of features, then all vectors are combined in one long vector with an empty value for each missing image. For unavailable images, the Principal Component Analysis (PCA) can be performed frequently to reconstruct the vectors of these lost images. In [21], a three-level hierarchical approach was performed using SVM and SVR with an AGES representation model, where shape, texture, and wrinkles features were fused together in one integrated model. CNN is one of the commonly recently used deep learning models to analyze images, videos, and other 2D and 3D data. Images can be interpreted by CNNs as three-dimensional volumes, then, after each layer of the CNN model, the input and output volumes are represented mathematically in terms of multi-dimensional matrices. After that, the dimensions of these matrices are transformed as the image goes deeper into a CNN model. CNN structures are composed of many layers, such as convolutional layers, sub-sampling layers, and fully connected layers [22].
A CNN model proposed in [23] combines handcrafted features and multistage learned features of the facial images. This model includes two approaches: the first approach is based on a feature-level fusion of several local handcrafted features of wrinkles, skin with some other Biologically Inspired Features (BIFs), and the second approach is score-level fusion of feature vectors learned from a CNN with multiple layers. In [24], a new CNN architecture is introduced as Directed Acyclic Graph Convolutional Neural Networks (DAG-CNN) for estimating human age, which automatically learns discriminative features obtained from different layers of a GoogLeNet CNN [25] and VGG16 CNN [26] models and combines them together. As such, in [24], they built two variant architectures. The first is DAG-GoogLeNet based on GoogLeNet CNN and the second is DAG-VGG16 based on VGG-16. Finally, the task of estimating human age was implemented in the decision layer. In [27], a transfer learning was used to solve the problem of gender and age recognition from an image using both VGG19 and VGGFace pretrained models. A hierarchy of deep CNNs was evaluated, which initially classifies persons by gender then predicts their age using separate male and female age prediction models.

Human Perception for Face Aging
With the increasing growth of real-world applications in the last decade, there has been also a growing interest in studying visual perception of human in facial age estimation tasks, and comparing the ability of machines versus humans in such tasks. Some researchers, such as [28], explored the ability of humans to estimate the age of subjects from their face images. Such that they showed a set of pictures to a group of participants and asked them to guess the age of each person shown in a picture based on their facial appearance. Simultaneously, they used the same images in building a separate machine learning model. After conducting performance evaluation and comparison, they concluded that the performance of human perception and machine learning attained close results in the age prediction task. Han et al. [29] proposed a framework for automatically estimating demographics from human face based on biologically inspired features (BIF). Furthermore, they used crowdsourcing mechanism to estimate human perceptual performance in demographic prediction on a variety of aging datasets, including FG-NET, FERET, MORPH II, PCSO, and LFW. This enables comparison between computer and human abilities in predicting three demographics (age, gender, and race). They found that their proposed framework can closely match the performance of human in demographic prediction in all three demographic prediction tasks and it performs slightly better than human with PCSO and MORPH II datasets.
In [30], a study was conducted on the use of a CNN-based model for estimating women's face ages automatically, and further for comparing machine and human performance in terms of which face regions they are much focused on during performing the same task on the same image. The reported human prediction was almost as accurate as the machine prediction using VGG16 model (i.e., 60% accuracy for the CNN model against 61% for human prediction). Their experimental results of tracking human's eyes showed that, when a participant focused their gaze more on the eyes or the mouth regions, their accuracy increased in estimating a person's age; conversely, their accuracy degraded when their gaze concentrated more on other facial skin regions. This may provide some clues to the fact that, indeed, humans may be able to accurately estimate age of others based on some semantic facial features. It also indicated that human-vision capabilities are still competitive and cannot be neglected, even when compared with nowadays sophisticated computer-vision capabilities, and, thus, they can be rather utilized to be integrated with computer-vision capabilities or perhaps to be emulated for achieving augmented performance in such a difficult age estimation task. As such, a reasonable open question can be: What would be the performance of age estimation in the fusion of the capabilities of both humans and machines?
Although some researchers in the age estimation field have paid attention to studying human perception and interpretation of facial aging and comparing their abilities with machines, exploiting humans' vision capabilities in analyzing face images and deriving detailed semantic features for age estimation purposes is yet to be investigated, which may, in turn, help bridging the semantic gap and effectively improve the latest machine-based performance. However, several previous biometrics research studies concerned with recognizing human face [31], identifying subjects [32] and recognizing gait signature [33] have employed human-vision in analyzing and annotating images or videos to derive semantic biometric traits. Their results proved the powerful and effectiveness of such semantic traits in improving model-based performance.
The main contributions of this paper can be summarized as follows: Exploiting human-vision capabilities in the process of extracting a novel set of semantic features to be used for a proposed semantic-based age estimation approach in fusion with different combinations of traditional vision-based features. Analyzing our proposed semantic features using different statistical tests to investigate and select the most significant set of semantic features for high-level (age group classification) and low-level (regression for exact age prediction) stages. Utilizing these semantic features in augmenting and improving the performance of traditional computer vision models in predicting exact age and age groups/levels via face image analysis, allowing for exploring and proving that these two forms of facial features (i.e., traditional visionbased and semantic human-based) are differently informative and effectively integrative in human age estimation tasks. Providing an initial research study that may establish further promising research tracks to be undertaken by researchers in the future utilizing our obtained results and findings in how to supplement computer vision-based features with human-based features for augmented facial-based human age estimation, in a way, filling/bridging the semantic gap between humans and machines.
The remainder of this paper is organized as follows. Section 2 explains images dataset, proposed novel semantic features, computer vision-based features, hierarchical multi-level age estimation and evaluation metrics used to evaluate proposed approaches. Experiments and results are presented in Section 3. Section 4 provides a conclusion and future work.

Methods and Materials
In this section, we will explain the face images dataset, computer-vision features, semantic features, hierarchical multi-level estimation approach, and the metrics used for model evaluation.

Dataset
To evaluate our proposed approach, we used a ready-made and publicly available standard dataset called Face and Gesture Recognition Network (FG-NET) [34]. It is one of the most popular and frequently used datasets in the field of facial aging and age estimation [3,35,36]. It contains 1002 face images for 82 individuals, 48 males and 34 females. Individuals in FG-NET dataset have from 6 to 18 color or grayscale images at different ages of the same individuals, with an average of 12 images per individual. The range of ages in FG-NET dataset varies between 0 and 69 years old. There are 68 landmark points provided with each image in FG-NET that represent and localize the face boundaries, eyebrows, lips, eyes, and nose.
Most FG-NET images are collected by scanning the personal photos of individuals from multiple races. Therefore, the quality of images depends on the skills of photographers, camera used, lighting when taking photos, and the quality of photographic paper. This consequently causes a variation among images in quality, resolution, expression, pose, viewpoint, and illumination. In addition, some occlusions appear in a number of images, like makeup, spectacles, hats, beards, and mustaches that may cover or blur the aging signs in some areas of skin. All these conditions posed additional challenges in building an adaptable model capable of analyzing facial aging and achieving desired results. A sample of images for one individual in FG-NET at variant ages is shown in Fig. 4.

Extracting Semantic Features
In this research, the Figure-Eight crowdsourcing platform [37] was used to collect our novel semantic features. Whilst, most of these semantic aging features were inspired by different forensic concepts of face aging [5], we designed web-based annotation forms and used them to show each FG-NET image to several annotators on the platform and asked them to annotate the displayed image by choosing the most applicable label from a set of given descriptive labels for each facial feature, in such a way describing the degree of presence of each semantic feature on the displayed subject face. These labels are either ordinal such as (None, Minimal, Fair, Marked, or Prominent) and (No wrinkles, Wrinkles with expressions, Wrinkles with rest, or Prominent wrinkles) or binary nominal labels such as (male or female) and (absent or present). The annotator should choose only one most appropriate option/label per feature, where each face image was labeled by multiple annotators. Subsequently, the labels were rescaled using z-score standardization given as: where x is a feature vector, l is the mean and r is the standard deviation. Then average-labels were calculated for every feature per face image. To facilitate the annotation task, we provided a few instructions with a visual example of an annotated image and grouped related features according to each targeted facial part. The designed annotation task form is displayed in Fig. 5.
This study exploited the human-vision capabilities of a set of people to analyze and extract changes occurring in texture and thickness of facial soft tissue during age progression and investigated their effects on human facial morphology. Moreover, global aging signs like beard, mustache, gray hair, and pallor were included, in addition to other aging-related features introduced in [38][39][40]. Tab. 1 shows all proposed 32 semantic face features with their corresponding labels.   Quality assurance services offered by the Figure-Eight platform were all set up to ensure data reliability, such as limiting tasks only for the highest accuracy and most experienced annotators, limiting the maximum number of judgments available for each annotator, pre-test questions for checking annotator understanding and excluding ineligible annotators, setting up detection and prevention of the extra-fast or random responses, and deploying our task for geographically unconstrained annotators to enable better reflection of average human-annotator perception. Tab. 2 provides a summary of collected semantic data.
To investigate and select the most significant semantic features in classification and regression stages, we applied two different statistical analysis tests:

ANalysis Of VAriance (ANOVA)
Analysis of variance (ANOVA) is a statistical test used to check whether the means of at least two groups differ from each other or not [41]. We performed ANOVA on our semantic features to measure the significance of each in distinguishing between different age groups in classification stage with a 0.01 significance level.
The p-value for all semantic features is much less than the significance level, 0.01. Therefore, we rejected the null hypothesis for all features and accepted the alternative hypothesis states that there is a difference in the means of at least two age groups. Tab. 3 introduces selected semantic features in descending order according to their effectiveness by F-ratio and corresponding p-value.

Pearson Correlation Coefficient
Pearson's correlation coefficient is used in statistics to measures strength and direction of linear correlation between two continuous variables x and y [42,43]. Assuming x and y are the means of first and second variables, respectively, n is the number of pairs in x and y, the formula of Pearson's correlation coefficient r is defined as in Eq. (2). When the correlation is closer to +1, this indicates a stronger positive correlation and highly correlated variables, whereas when it is closer to -1, this means a stronger but negative correlation between variables. Correspondingly, the correlation becomes either way weaker as it closes to zero.
As shown in Fig. 6, the correlation between age and each of our semantic features is strongly positive, except for three features, which are gender and presence of mustache or beard. This is due to that very few subjects appear in FG-NET dataset images with mustache or beard, such that only 36 images show subjects with a mustache and 31 with a beard. Furthermore, there is a lack of a direct or sensible relationship between human gender and age, due to the natural stability of gender over the lifetime. Nevertheless, extended analysis and exploration of different aging characteristics with respect to gender could enable more understanding of the similarities and differences between males and females (or average male and average female) in aging progression changes or signs from different perspectives and various aspects, such as what, when, where, and how similar or different those aging characteristics are expected to be, which could, in turn, be considered to help age estimation. Thus, these three semantic features were excluded from regression stage.
Tab. 4 provides semantic features in descending order based on their correlation strength with age in addition to corresponding p-value for each feature at 0.01 confidence level.

Extracting Computer-vision Features
To prepare the data images for computer vision-based feature extraction stage, all colored images were converted into grayscale. Next, since the FG-NET contains images from personal image collections of subjects, pose variation is likely observed in many subject images. Thus, images were rotated to a standard face pose. Image rotation was conducted based on the coordinates of the center of the two eyes to align all images vertically, where the rotation angle h is computed as follows: Þare the position of the right and left eyes, respectively. Fig. 7 provides an example of a face image before and after preprocessing.

Active Appearance Model
Active Appearance Model (AAM) is a computer-vision model proposed by Cootes et al. [44]. It combines shape and gray-level appearance statistical models to represent and interpret objects in images. It was developed and used in many applications such as objects tracking, gait analysis, human eyes modeling, facial expression recognition, and medical image segmentation [45]. We have employed AAM [46] proposed for fast AAM fitting in-the-wild.
In this work, to build the shape model, a set of 68 landmark points (X 1 , Y 1 , X 2 , Y 2 , …., X 68 , Y 68 ) describing the shape of a human face is required across D training images. Procrustes analysis was applied to remove similarity transformation from original shapes and get D similarity-free shapes. Then, PCA was performed on these shapes to obtain the shape modelŝ, defined by a set of shape eigenvectors S and the mean shape s 0 . The statistical shape model can be expressed as: where s is a new similarity-free shape and p contains the shape parameters of a deformable model. Building the appearance model requires removing variation of shapes from the texture. This was achieved by warping each texture I to the reference frame obtained using the mean shape. After that, PCA was performed on shape-free textures to obtain the appearance modelÎ defined by a set of appearance eigenvectors A and the mean normalized appearance vector A 0 . By assuming c represents the appearance parameters, the statistical appearance model can defined as: Fast-Forward algorithm was used to fit the AAM model to extract global facial shape and appearance from each image. Fig. 8 below illustrates detected landmark points with corresponding appearance.

Geometric Ratios
Craniofacial changes occur in human face with age progression and convey much information about facial aging. They are useful in distinguishing children from adults and studying facial aging signs related to children, who are characterized by their rapid geometric growth [47].
Due to the fact that the facial shape may be affected by expressions and pose, this might influence on the localization of facial landmark points. Therefore, the ratios between distances were used to measure human craniofacial growth rather than solely using distances. We calculated 19 distances D via 23 landmark points and used them to form 18 different ratios R as geometric features as in [47,48]. All landmark points in this section were produced using AAM as demonstrated in the previous section, except for point number 23, which was manually localized due to hair covering the forehead or bald appearing for some individuals in the FG-NET dataset. The ratio R between two distances is calculated as: where p 1 , p 2 , q 1 and q 2 are coordinates of p and q landmark points. Fig. 9 presents 23 landmark points used to calculate the ratios of geometric features and Fig. 10 illustrates 18 computed shape ratios of distances between different landmark points. The difference between an adult and a child in R 6 ratio between face width and height eyebrow center and chin points is shown in Fig. 11.

Local Binary Patterns
Lately, in human facial age estimation, local skin features analysis has achieved substantial efficiency [49] as it can represent the significant facial information related to soft tissue aging and remove noise such as hair, background, and non-skin areas. As shown in Fig. 12, eleven facial regions were cropped to be used in deriving local texture features using Local Binary Patterns (LBP).  LBP's success stems from the robust binary code that is highly sensitive to texture and soft tissue changes uninfluenced by the light, noise, and facial expressions [18]. LBP takes the value of each pixel as a threshold and calculates 8-bit binary code for their neighboring pixels. Then, a histogram is generated as a texture descriptor [50,51]. The LBP code is expressed as: where g C is the gray value of center pixel,g P is the gray value of equally spaced pixels on the radius circle, s is a thresholding function, P refers to the number of neighboring pixels and R is the distance between center and neighboring pixels.

Gabor Filters
Gabor filters have been commonly used in facial age estimation to extract wrinkles and edge features, because of their ability to determine both the orientation and magnitude of wrinkles and lines [51]. It is characterized by its robustness to noise caused by aspects such as glasses and beard [52]. Here, on each pixel of the face image, filter banks of 2D Gabor filters, including five frequencies and eight orientations, were used in our approach as illustrated in Fig. 13. The Gabor filter formula [53] is defined as follows: where h denotes the orientation, f represents the frequency, r refers to standard deviation of Gaussian envelope, f is the offset of phase, and c is spatial aspect ratio that specifies the ellipticity to support Gabor function. The local skin regions shown in Fig. 12 were also used here to extract Gabor features. Finally, all computer-vision features were rescaled using Z-score standardization Eq. (1), then, due to the high dimensionality of shape, appearance, LBP and Gabor features, PCA was performed as a feature subset selection method to determine and select the most effective components. Tab. 5 summarizes the determined number of the most effective principal components for each feature type.

Hierarchical Classification and Regression
Estimating human facial age may be a problem of multi-classification [54][55][56], regression [57,58], or a hybrid of both techniques [18,59]. Many results of previous research explorations have proved that a hybrid or hierarchical age estimation approach outperformed single-stage approaches [60].
In this work, we applied a multi-level hierarchical age estimation approach which classifies images into their appropriate age groups (as high-level age estimation), then predicts an exact age value using regression (as low-level age estimation). A single (one-vs-one) SVM multi-classifier was trained to classify images into one of four age groups: from newborn to toddler (0-3 years), from pre-adolescence to adolescence (4-16 years), from young adult to adult , and from middle-aged to senior  Year).
In the regression stage, four SVM regression models were trained. Each regressor was trained using a specific age group to be responsible for predicting ages within the range of this age group. To reduce the effect of misclassification, the regressors were trained with a range overlapping between each adjacent regressors. The range of overlapping among age groups was experimentally selected for better regression performance. Fig. 14 shows the structure of our multi-level age estimation.

Evaluation Metrics
The accuracy measure is used to evaluate the performance of SVM in age group classification inferred as: where C is the number of correctly classified images and N refers to the number of all tested images.
The Mean Absolute Error (MAE) and Cumulative Score (CS) evaluation metrics have been commonly used to evaluate the performance of specific age estimation [61]. MAE is defined as the mean of absolute difference between ground-truth and predicted ages, which is expressed as: where N refers to the number of tested images, agê i is the predicted age of image i and age i is the groundtruth age of image i.
The CS is the ratio of the number of images whose errors are less than or equal to the threshold value, divided by the total number of test images. CS enables performance comparison at different absolute error levels, which can be defined as: where N e is the number of images whose estimation error is less than the defined threshold x, and N is the total number of test images. Figure 14: Multi-level hierarchical classification and regression with age groups overlapping to reduce the effect of misclassification on age prediction

Experiments and Results
Each person in FG-NET dataset has several images at different age stages. Therefore, dividing the dataset into 80% for training and 20% for testing like [62,63], appears to be not an ideal way to test our models, due to the possibility of having very similar images belonging to one person in both training and testing stages. Hence, to prevent that, we have followed a Leave One Person Out (LOPO) validation technique as applied in [64][65][66][67][68] to unbiasedly and effectively evaluate our proposed approach. In LOPO, all images belonging to one person are included into testing set while the remaining images of other people in the dataset are used to train the model. As such, the LOPO strategy is repeated as per the number of individuals in the database with a different individual one-by-one used for testing each time, while the final results are calculated from all the estimates.
The accuracy metric is used here to evaluate the performance of SVM in classifying images into their respective age groups. As shown in Tab. 6, we carried out several classifications with various combination of features. The accuracy results were firstly presented using computer-vision-based features alone and secondly after being augmented with our novel semantic features. All proposed 32 semantic features were included in the classification stage based on deduced significance analysis by ANOVA. A significant improvement was observed in all classification experiments after supplanting the vision-based classifiers with the semantic features. The best obtained accuracy result was 75.15% when adding semantic features against only 63.87% when using vision features alone. We will rely on these experiments later to perform regression process. Fig. 15a shows the confusion matrix illustrates how images are classified when computer-vision features are used in isolation. It is clear that age group classification is achieved with high misclassification rate, especially in the fourth age group for people of middle-aged to senior  ages. On the other hand, Fig. 15b provides the confusion matrix for the same age group classification approach but after being augmented with the proposed semantic features. As can be observed, the accuracy of age classification was significantly increased in all age groups. Furthermore, it was reported that a majority of misclassified images have been assigned to their adjacent age groups, indicating a closer age group estimation to the actual age group, meaning a smaller range of error unlike the traditional vision-based methods, which were often found randomly assigning a far falsely age group in such misclassification cases. This demonstrates the potency of these semantic features and how they can be capable to distinguish several age groups with less possible error.
The MAE measurement evaluates the performance of the used regressors in estimating exact ages. In the regression stage, we have eliminated gender, mustache, and beard semantic features from our experiments due to their weak effect and relationship to the exact age value as proved by Pearson's correlation coefficient analysis.   Tab. 8 summarizes the MAE produced by each age group separately to clarify the positive impact of semantic features on each regressor. The results show how semantic features decrease MAE at all regressors, while a clear improvement was remarkably noticed with older ages within the conducted experiments. This reflects the viability and great sensitivity of our proposed semantic features in capturing and analyzing details of soft facial tissues, which are essential signs of aging in adults and seniors.
The distribution of images in FG-NET dataset is not balanced for all ages, where the number of images decreases as the age increases, since regression models usually need sufficient samples for training to be able to predict ages more efficiently. As a result, the MAE gets worse as the average of ages within the age group increases.
The cumulative score was computed at different error levels ranging from 0 to10 years. Consequently, 83.53% of images have a level error less than or equal to 10 years when using computer-vision features, while the cumulative score percentage improved to reach 90.12% after augmenting them with semantic features.
After a considerable improvement was proved for the new proposed semantic features by all means in the carried out experiments, we have used them alone to train a classifier and regressors on age estimation task for validating their pure viability and performance in isolation. The received results were 61.88% classification accuracy, 6.15 MAE, and 83.23% CS for age prediction. Fig. 16 shows and compares the CS performance when using computer-vision features, semantic features, and the fusion of both semantic and vision features. Finally, we used the features of the second experiment as shown in Tab. 7 that achieves the best MAE (4.41 years); a Bayesian optimization was performed for each regressor independently, because each of them deals with a different age range. The Bayesian optimization intends to search for optimum hyperparameters that give a minimum error. We obtained 3.90 MAE and 92.42% CS at 10 years' error level. Fig. 17 and Tab. 9 compare the CS and MAE before and after optimization process, respectively. Fig. 18 provides samples of good and poor age estimation of some images along with their respective actual and estimated age value. Figure 17: Cumulative Score of our proposed augmented approach before and after optimization process In this paper, we investigated the problem of human age estimation from face images. We proposed a set of novel semantic features representing various facial aging characteristics or signs. We examine significance and effectiveness of each semantic feature using diverse statistical analysis tests, then we exploit powerful features to augment traditional computer-vision-based features. A hierarchical estimation approach was applied by performing classification to predict age group followed by regression to estimate the exact age. FG-NET aging dataset with variety in expressions, races, resolutions, poses and illumination was used to evaluate the performance of our proposed approach.
Experimental results proved a significant and remarkable improvement in age estimation performance, by all means, occurred at the results of both stages: age group (high-level) classification; and exact age (lowlevel) prediction, when fusing and supplementing traditional computer vision-based features with our proposed human-based semantic features, especially with ages in the fourth age group ranging from middle-aged to senior (i.e., from 40 to 69 years old).
This initial research study may establish further research tracks to be undertaken by researchers in the future utilizing our obtained results and findings in how to supplement computer vision-based features with human-based features for augmented facial-based human age estimation.
As future work, increased discrimination for more robust age estimation capabilities can be investigated by fusing our proposed semantic features with deep features, where it is highly expected, based on this research's results and conclusions, that our semantic features would be most likely capable to augment/ enhance the performance when fused with the most recent models based on deep learning and CNN, and to outperform their performance when used in isolation. Learning for automatic semantic face feature extraction and annotation is a potential desired future work that helps researchers to improve computer-vision techniques and bridge the semantic gap between machines and humans automatically.