Improved Face Recognition Across Poses Using Fusion of Probabilistic Latent Variable Models

Uncontrolled environments have often required face recognition systems to identify faces appearing in poses that are different from those of the enrolled samples. To address this problem, probabilistic latent variable models have been used to perform face recognition across poses. Although these models have demonstrated outstanding performance, it is not clear whether richer parameters always lead to performance improvement. This work investigates this issue by comparing performance of three probabilistic latent variable models, namely PLDA, TFA, and TPLDA, as well as the fusion of these classifiers on collections of video data. Experiments on the VidTIMIT+UMIST and the FERET datasets have shown that fusion of multiple classifiers improves face recognition across poses, given that the individual classifiers have similar performance. This proves that different probabilistic latent variable models learn statistical properties of the data that are complementary (not redundant). Furthermore, fusion across multiple images has also been shown to produce better perfomance than recogition using single still image.


Introduction
Face recognition technology has played an important role in various automatic tasks, e.g., access control [1,2], security and surveillance [3,4], human computer interaction [5], and multimedia annotation [6]. Faces in these tasks become central objects, based on which, human identities are confirmed. Compared to other biometrics, such as fingerprints or irises, faces provide a more natural, direct, friendly, convenient, and non-intrusive means of human identification. These biometrics therefore demonstrate a high level of acceptance and offer a wide potential application.
While automatic face recognition has been carried out successfully in controlled experiments, its practical use is still limited. Situations in real-world environments change unpredictably and might significantly degrade recognition performance. Pose variation is a major factor that critically affects face recognition. This variation induces non-convex facial shapes, selfocclusion, and nonlinear changes of shapes and appearances that complicate classification. Furthermore, probe faces in real-world environments often appear in poses that are totally different from those in the enrollment databases. Identification in this case has to be performed by matching face images across poses.
Two approaches have been proposed to address face recognition across poses: (1) recognition through resynthesis, and (2) matching in pose-invariant spaces. The first approach reconstructs probe faces and enrollment samples in a reference pose (target pose) and applies traditional classification methods afterward. Geometric shape models, such as ASMs, AAMs, and 3DMMs, have been employed to facilitate accurate reconstruction [7][8][9]. Statistical methods, particularly linear and nonlinear regressions, have also been used to recon-struct image patches [10][11][12], image features, e.g., Gabor jets [13], or other representations, e.g., mixture distributions [14], in the desired view. The second approach transforms faces of different poses into some pose-invariant representations and infers face identities by matching these representations in the pose-invariant spaces. The 3DMMs are fitted to face images in [15] to produce 3D shape and texture parameters that serve as pose-invariant representations. Statistical methods, such as subspaces alignment [16] and kernel discriminant analysis [17], have also been employed for the same purpose. Recognition across poses is performed in [18] through the use of light-fields, i.e., the concatenation of face images of individuals from a number of poses. Probe faces and enrollment samples are viewed as light-fields with missing values whose least-square projections to the Eigenspace serve as pose-invariant representations. More recently, probabilistic latent variable models [19][20][21] have been applied to face recognition across poses with superior performance. These methods assume that there genuinely exists a multidimensional latent variable that uniquely represents the identity of an individual's faces irrespective of their poses. Using these models, the likelihoods that face images with different poses actually correspond to the same identity (latent variable) can be estimated.
While probabilistic latent variable models have been successfully used in face recognition across poses, it is unclear whether richer parameters always lead to performance improvement. It is yet to know, for example, that the use of pose-specific transformations (tied models) will make the generic transformations (non-tied models) completely void. Similarly, it is important to confirm that explicit modelling of within-class variations (discriminant analysis) will always be a better and complete substitute for non-explicit modelling (factor analysis). This work investigates this issue by comparing the performance of variants of probabilistic latent variable models as well as the fusion of these classifiers. More specifically, three classification models are evaluated: probabilistic linear discriminant analysis (PLDA) [19], tied factor analysis (TFA) [20], and tied probabilistic linear discriminant analysis (TPLDA) [21]. Unlike the existing work, the evaluation is performed not only on still images but also on videos. Videos differ from still images in the much larger number of images available for each individuals and the dense face samples within the pose space. It is therefore interesting to see how such rich data benefit the recognition. Figure 1 shows the proposed framework for face recognition across poses. The framework is composed of two components: front-end and classifier. The front-end serves to localize faces in videos, localize facial landmarks and estimate head poses, extract features, and group the features based on the estimated poses. The classifier matches probe faces, assumed to be non-frontal, against enrollment samples, which are frontal. Matching scores are computed based on probabilistic latent variable models that are constructed in the training stage. The rest of this section describes in more detail each of the processes involved in the proposed framework.

Face and Facial Landmark Localization and Head Pose Estimation
Given a video frame as input, facial ROIs (regions of interest, i.e., bounding boxes) are localized using a combination of three Viola-Jones face detectors [22]. These detectors have been trained for three discrete poses: frontal, left-profile, and right-profile. The frontal detector is applied first and is followed by the application of the left-profile detector. If the frontal detector successfully detects a face, the left-profile detector is applied only to a small area around this detection result. The right-profile detector is executed only when the left-profile detector does not give a positive result. This procedure is able to anticipate left-right head rotations and might return multiple detection results for one particular face.
After facial ROIs have been detected, the process continues with the search of facial landmarks. We train a cascaded regression model that is able to perform simultaneous facial landmark localization and head pose estimation [23,24]. Cascaded regression has been well known as an accurate and reliable method for facial landmark localization. In this work, the model has also been trained to handle occluded facial landmarks (for faces that rotate away from frontal). This model makes use of multiple facial ROIs as input to produce a single final output.

Face Normalization and Feature Extraction
Based on the estimated head poses, faces are classified into frontal, half-profile, and profile, which are defined as 0-20°, 20°-50°, and 50°-90° of left-right rotation, respectively. Note that faces facing to the left direction are flipped horizontally. Before appearance features are extracted, the faces are normalized and segmented.
Piece-wise triangular warp is employed to normalize face images. This technique has been observed in this research to work better than the traditional procedures, i.e., similarity transforms. Piece-wise triangular warp employs point distribution models (PDMs) [25] to perform normalization. A PDM represents 2D facial meshes using a set of orthogonal basis shapevectors.
Three PDMs are constructed for frontal, half-profile, and profile faces, respectively. Given a number of facial landmarks returned by the cascaded regression model, least square projection is performed to obtain the complete parameters of the PDMs as well as the corresponding 2D mesh. Figure 2 shows the estimated 2D meshes of different faces as well as the results of piece-wise triangular warp for the normalization. Note that a single reference mesh is used to deform (warp) all faces of a particular pose. Compared to similarity transforms, piecewise triangular warp produces better correspondences of facial parts at the cost of losing facial shape information. The warp faces are resized into 51×51 ROIs, whose intensity values are concatenated to form feature vectors of 2601 elements.

Classification
As mentioned earlier, probabilistic latent variable models are employed in this work to match face images across poses. These include PLDA [19], TFA [20], and TPLDA [21], first proposed by Prince and colleagues. The tied models generalize the "original" models by introducing pose-specific generative transformations over the single latent identity space. More explicitly, PLDA can be described as while TFA and TPLDA can be expressed as and respectively. The term x ijk represents the j-th observation of class i in pose k. For each pose k, 4 parameters are defined: the mean µ k , the bases F k and G k , and the diagonal covariance matrix Ʃ k of ε ijk . TFA and TPLDA models (analogous to PLDA) can be trained using an EM algorithm that executes two computation steps iteratively until it converges. In the expectation step, the expected values of latent variables h i and w ijk are calculated for each individual i using data of the individual from all poses x ij• . In the maximization step, model parameters F k , G k , and Ʃ k are optimized for each pose k using data of the pose from all individuals x ••k . Interested readers are encouraged to refer to the comprehensive discussion of this algorithm in [19][20][21]26].
The trained models are used to recognize probe faces during the recognition phase. Prince and Elder [19] propose a Bayesian model comparison approach that assumes data points of the same class are generated from the same value of LIV. Given a probe x p and samples of M classes x 1 , x 2 , … , x M , there will be M generation models M 1 , M 2 , … , M M to consider. M m represents the case where x p and x m are bound to the same LIV, which is h m , while the other samples are bound to their own LIVs. The likelihood P(x p , x 1 … M |M m , θ) can then be defined as where θ is the set of model parameters. The posterior of the generation model is obtained as , … , P(M m ) are assumed to be uniform. In this research, only closed-set identification is considered. Classification systems will thus not be probed by individuals who are not enrolled in the systems (impostors). Furthermore, multiple enrollment samples are available for each individual. Suppose that a model θ is employed for the classification. Given a test image x p , matching score of x p and class i is The identity of the probe x p can then be inferred as In [19][20][21], high recognition rates have been achieved by fusing matching scores across different local areas. Inspired by this idea, this research investigates the possibility of improving performance by fusing matching scores from different classifiers. For a fusion to be successful, the fused classifiers must not be redundant. This research conjectures that PLDA, TFA, and TPLDA capture statistical properties of data that are complementary. Matching score of x p and x i under the fusion of classifiers can be expressed as: where θ 1 , … , θ S are the fused classifiers. Later in the experiments, we also apply fusion across video frames f 1 , f 2 , … , f P which can be expressed as

Results and Analysis
Experiments are conducted to evaluate performance of different classification models. The experiments make use of enrollment samples that consist of frontal faces only. The probe faces include half-profile and profile faces.

Datasets for Evaluation
Two datasets are collected for the experiments: the VidTIMIT+UMIST dataset [27,28] and the FERET dataset [29]. The VidTIMIT database [27] contains videos of 43 individuals who are asked to perform an extended sequence of head rotation. The rotation starts with the head facing forward, followed by facing to the right, to the left, back to forward, up, down, and finally return to forward. Three video sequences with a resolution of 512×384 are recorded from each individual in three sessions, respectively.
The UMIST database [28] contains 20 individuals, each of whom appears in various poses ranging from profile to frontal. Faces are captured as grey-scale images with a resolution of 220×220. Eighteen individuals from the UMIST database are merged with those from the VidTIMIT database to yield a total of 61 individuals. Using the merged data, three pairs of training and test sets are constructed. The training sets contain 10+24 individuals from the UMIST and the VIDTIMIT databases, respectively (randomly selected). The test sets contain the remaining 8+19 individuals from the UMIST and the VIDTIMIT databases, respectively.

Experiments using VidTIMIT+UMIST dataset
The training and the test data for these experiments are described in Section 3.1. Note that individuals used for testing are completely different from those used for training. The test data are divided into enrollment samples and probe data. The enrollment samples consist of "frontal" faces while the probe data consist of "half-profile" and "profile" faces. To detect faces, facial landmarks, and head poses from face images, the front-end described in Section 2 is employed. Figure 3 and Figure 4 show results of the experiments, presented in the form of the number (in percentage) of the successfully recognized images. The Eigen light-fields method is used as the baseline. When only individual classifiers are considered, TFA demonstrates the best recognition rates, i.e., 94.46 ± 0.71% and 70.95 ± 2.68% for half-profile and profile faces, respectively. TPLDA demonstrates recognition rates of 88.81 ± 2.40% and 48.10 ± 6.79% and PLDA demonstrates recognition rates of 85.67 ± 1.47% and 51.38 ± 11.87% for half-profile and profile faces, respectively. The Eigen light-fields method has become the worst performer. It should be noted, however, that the superiority of TFA doesn't apply to experiments with the FERET database (Section 3.3). TFA has therefore simply better captured statistical properties of the data than other classifiers have for this particular dataset. Figure 3 also shows recognition results of half profile faces using fusion of classification models (Equation (5)). As can be seen from the figure, all fusion cases have better performance than the corresponding individual models, thus showing the finding that the fused models are complementary (not redundant). The highest recognition rate is achieved by the combination of the three classification models (95.57 ± 1.36%). The second highest recognition rate is achieved by the combination of TFA and PLDA, which are actually the best two individual models (95.25 ± 1.93%). Compared to recognition using individual classification models, peak performance increases from 94.46 ± 0.71% to 95.57 ± 1.36%.  For recognition of profile faces as shown in Figure 4, fusion significantly outperforms individual classification models only for the combination of PLDA and TPLDA. When the fusion combines TFA (the best individual classification model) and other classification models, it hardly outperforms the individual models or even degrades the performance. These results therefore highlight the second requirement for a fusion to be effective: The fused classifiers should have similar individual performance (as is the case with PLDA and TPLDA). When there is too much discrepancy between the fused classifiers, the gain produced by the fusion is not enough to compensate the discrepancy between the classifiers. Figure 4 shows that the best fusion case corresponds to the combination of the three classification models. This combination reaches peak performance of 72.13 ± 8.49% which is better than the peak performance of individual models (70.95 ± 2.68%).

Experiments using FERET dataset
Data for these experiments are described in Section 3.1. Classification models are constructed using the training sets, each of which contains 219 individuals. Each test set contains 100 individuals that are further classified as enrollment samples (frontal faces) and probe data (non-frontal faces). To extract appearance features, faces are segmented from the background using an iterative graph-cuts procedure. The segmented faces are registered to standard templates and placed against a mid-gray background. The registration is performed using a piece-wise linear warp based on 21 manually annotated facial landmarks. Figure 5 shows recognition results of half-profile faces using frontal faces as samples. For matching to the mean of samples: is used to compute matching scores since it produces better results than matching to the nearest sample. TPLDA has become the best performer with a peak recognition rate of 81.50 ± 6.61%. TFA and PLDA become the second and the third best performer, respectively, demonstrating recognition rates of 73.67 ± 10.68% and 68.83 ± 2.47%, respectively. These results are thus different from those obtained from the VidTIMIT+UMIST dataset where TFA becomes the best performer followed by TPLDA and PLDA. Figure 6 shows recognition results of profile faces using frontal faces as samples. Similar to previous results, TPLDA, TFA, and PLDA have become the best, the second best, and the third best performers, respectively. TPLDA achieves a peak recognition rate of 55.50 ± 5.20%. TFA and PLDA achieve peak recognition rates of 54.50 ± 6.93% and 50.17 ± 8.28%, respectively. These results are again different from those obtained from the VidTIMIT+UMIST dataset, where TFA, TPLDA, and PLDA become the best, the second best, and the third best performers, respectively.  Figure 5 shows recognition results of half-profile faces using the fused classifiers. As can be seen from the figure, all fusion cases give better performance than the corresponding individual models. The highest recognition rate is achieved by the combination of the three classification models (86.17 ± 3.82%). Figure 6 shows similar situations for recognition of profile faces. All fusion cases have better peak performance than the corresponding individual models, with the combination of the three models becoming the best performer (63.00 ± 5.66%). These results again highlight the finding that the tested classification models are complementary. It should also be noted that the three individual models have similar performance, explaining why the fusion is effective. Compared to recognition using individual classification models, the fusion increases peak recognition rates from 81.50 ± 6.61% to 86.17 ± 3.82% for recognition of halfprofile faces and from 55.50 ± 5.20% to 63.50 ± 5.66% for recognition of profile faces. From experiments on the VidTIMIT+UMIST dataset as well as on the FERET dataset, it can be concluded that fusion of different classifiers effectively improves face recognition across poses. The combinations of classifiers, however, perform differently on different datasets. It appears that when the fused classifiers differ only slightly in performance, the fused classifiers have better performance than the individual classifiers. To choose the most optimal combination of classifiers for a particular deployment, the fusion can be tested on a validation data before it is employed in the real task. Another possibility is simply fusing the three classification models altogether. It has been observed that fusion of the three models outperform the three individual classifiers most of the time.

Experiments on Videos
Previous experiments compute recognition rates by counting the number of successfully recognized images from test videos. Even though frames of training videos have been collectively used to construct classification models, recognition in these experiments is still performed based on still images. The reported recognition rates thus indicate only the probability of correct recognition, given a single image as input. To actually employ videos in the recognition, identities need to be inferred based on multiple images. In this section, such recognition is performed by fusing matching scores across video frames, which is also known in the literature as the decision level fusion. Two fusion methods are considered: voting and product rule (multiplying matching scores, Equation (6)). Table 1 and Table 2 show results of the fusion on the VidTIMIT+UMIST dataset. Probabilistic latent variable models are trained to include 42 basis vectors and matching to the nearest sample is used to compute matching scores of individual images. For recognition of half-profile faces as shown in Table 1, fusion across video frames has given a better peak performance than using a single still image (98.77 ± 2.14% vs. 95.57 ± 1.36%, Section 3.2). The best recognition rate is achieved when TFA or combinations of TFA and other classifiers are employed together with the product rule. Note that TFA seems to be dominant whenever it is combined with other classifiers. This can be seen from the performance of the combined classifiers, which is identical to the performance of TFA alone. For recognition of profile faces as shown in Table 2, fusion across video frames has also given a better peak performance than using a single still image (81.48 ± 11.11% vs 72.13 ± 8.49%, Section 3.2). The best performance is achieved by the combination of TFA and PLDA coupled with the product rule. Table 3 and Table 4 show results of the fusion on the FERET dataset. Compared to recognition using a single still image (Section 3.3), fusion across video frames has given better peak performance: 91.33 ± 3.21% vs. 86. 17  half-profile and profile faces, respectively. The best recognition rate is achieved when the product rule is applied to matching scores obtained from the combination of the three classification models. Note that voting is not tested on this dataset since there are only two probe images for each individual. The improved performance given by fusion across multiple frames on the Vid-TIMIT+UMIST and the FERET datasets highlights the advantages of using video over single still image. The multiple observations available in videos provide additional information that can be employed to solve ambiguity in recognition.

Conclusion
This research evaluates the application of probabilistic latent variable models, namely PLDA, TPLDA, and TFA, as well as fusion of these classifiers, to face recognition across poses. Half-profile and profile faces are used as inputs to the recognition system, where frontal faces are used as enrollment samples. The evaluation is conducted using still images and videos, in particular, the VidTIMIT+UMIST and the FERET datasets are collected for this purpose.
Results of the experiments have shown that fusion of classifiers (at the decision level, i.e., product rule) generally produces better recognition performance than individual classifiers. This proves that different probabilistic latent variable models learn and capture statistical properties of data that are complementary. There is an important note, though, that fusion seems to produce clear improvement when the fused individual classifiers only slightly differ in performance. The optimal combination of classifiers also seems to vary from dataset to dataset. For the VidTIMIT+UMIST dataset, the peak performance increases from 94.46 ± 0.71% to 95.57 ± 1.36% and from 70.95 ± 2.68% to 72.13 ± 8.49% for recognition of half-profile faces and profile faces, respectively. For the FERET dataset, the peak performance increases from 81.50 ± 6.61% to 86.17 ± 3.82% and from 55.50 ± 5.20% to 63.50 ± 5.66% for recognition of half profile and profile faces, respectively.
To actually employ videos for face recognition, fusion has also been applied across video frames. Product rule and voting are used as the fusion method at the decision level.
Results of experiments have shown that recognition using videos produces better performance than using single still image. For the VidTIMIT+UMIST dataset, the peak performance increases from 95.57 ± 1.36% to 98.77 ± 2.14% and from 72.13 ± 8.49% to 81.48 ± 11.11% for recognition of half-profile faces and profile faces, respectively. For the FERET dataset, the peak performance increases from 86.17 ± 3.82% to 91.33 ± 3.21% and from 63.50 ± 5.66% to 64.50 ± 3.54% for recognition of half-profile faces and profile faces, respectively.