In this work, we present the problem of automatic appearancebased facial analysis with machine learning techniques and describe common speciﬁc subproblems like face detection, facial feature detection and face recognition which are the crucial parts of many applications in the context of indexation, surveillance, accesscontrol or humancomputer interaction.
To tackle this problem, we particularly focus on a technique called Convolutional Neural Network (CNN) which is inspired by biological evidence found in the visual cortex of mammalian brains and which has already been applied to many different classi
fication problems. Existing CNNbased methods, like the
face detection system proposed by Garcia and Delakis, show that this can be a very effective, efficient and robust approach to nonlinear image processing tasks.
An important step in many automatic facial analysis applications, e.g. face recognition, is face alignment which tries to translate, scale and rotate the face image such that specific facial features are roughly at predeﬁned positions in the image. We propose an efficient approach to this problem using CNNs and experimentally show its very good performance on difficult test images.
We further present a CNNbased method for automatic facial feature detection. The proposed system employs a hierarchical procedure which first roughly localizes the eyes, the nose and the mouth and then reﬁnes the result by detecting 10 different facial feature points. The detection rate of this method is 96%
for the AR database and 87% for the BioID database tolerating an error of 10% of the interocular distance.
Finally, we propose a novel face recognition approach based on a speciﬁc CNN architecture learning a nonlinear mapping of the image space into a lowerdimensional subspace where the different classes are more easily separable.
We applied this method to several public face databases and obtained better recognition rates than with classical face recognition approaches based on PCA or LDA.
We also present a CNNbased method for the binary classiﬁcation problem of gender recognition with face images and achieve a stateoftheart accuracy.
The results presented in this work show that CNNs perform very well on various facial image processing tasks, such as face alignment, facial feature detection and face recognition and clearly demonstrate that the CNN technique is a versatile, efficient and robust approach for facial image analysis.
Contents
1 Introduction
1.1 Context
1.2 Applications
1.3 Difficulties
1.3.1 Illumination
1.3.2 Pose
1.3.3 Facial Expressions
1.3.4 Partial Occlusions
1.3.5 Other types of variations
1.4 Objectives
1.5 Outline
2 Machine Learning Techniques for Object Detection and Recog nition
2.1 Introduction
2.2 Statistical Projection Methods
2.2.1 Principal Component Analysis
2.2.2 Linear Discriminant Analysis
2.2.3 Other Projection Methods
2.3 Active Appearance Models
2.3.1 Modeling shape and appearance
2.3.2 Matching the model
2.4 Hidden Markov Models
2.4.1 Introduction
2.4.2 Finding the most likely state sequence
2.4.3 Training
2.4.4 HMMs for Image Analysis
2.5 Adaboost
2.5.1 Introduction
2.5.2 Training
2.6 Support Vector Machines
2.6.1 Structural Risk Minimization
2.6.2 Linear Support Vector Machines
2.6.3 Nonlinear Support Vector Machines
2.6.4 Extension to multiple classes
2.7 Bag of Local Signatures
2.8 Neural Networks
2.8.1 Introduction
2.8.2 Perceptron
2.8.3 MultiLayer Perceptron
2.8.4 AutoAssociative Neural Networks
2.8.5 Training Neural Networks
2.8.6 Radial Basis Function Networks
2.8.7 SelfOrganizing Maps
2.9 Conclusion
3 Convolutional Neural Networks
3.1 Introduction
3.2 Background
3.2.1 Neocognitron
3.2.2 LeCun’s Convolutional Neural Network model
3.3 Training Convolutional Neural Networks
3.3.1 Error Backpropagation with Convolutional Neural Networks
3.3.2 Other training algorithms proposed in the literature .
3.4 Extensions and variants
3.4.1 LeNet5
3.4.2 Space Displacement Neural Networks
3.4.3 Siamese CNNs
3.4.4 Shunting Inhibitory Convolutional Neural Networks .
3.4.5 Sparse Convolutional Neural Networks
3.5 Some Applications
3.6 Conclusion
4 Face detection and normalization
4.1 Introduction
4.2 Face detection
4.2.1 Introduction
4.2.2 Stateoftheart
4.2.3 Convolutional Face Finder
4.3 Illumination Normalization
4.4 Pose Estimation
4.5 Face Alignment
4.5.1 Introduction
4.5.2 Stateoftheart
4.5.3 Face Alignment with Convolutional Neural Networks .
4.6 Conclusion
5 Facial Feature Detection
5.1 Introduction
5.2 Stateoftheart
5.3 Facial Feature Detection with Convolutional Neural Networks .
5.3.1 Introduction
5.3.2 Architecture of the Facial Feature Detection System .
5.3.3 Training the Facial Feature Detectors
5.3.4 Facial Feature Detection Procedure
5.3.5 Experimental Results
5.4 Conclusion
6 Face and Gender Recognition
6.1 Introduction
6.2 Stateoftheart in Face Recognition
6.3 Face Recognition with Convolutional Neural Networks
6.3.1 Introduction
6.3.2 Neural Network Architecture
6.3.3 Training Procedure
6.3.4 Recognizing Faces
6.3.5 Experimental Results
6.4 Gender Recognition
6.4.1 Introduction
6.4.2 Stateoftheart
6.4.3 Gender Recognition with Convolutional Neural Networks
6.5 Conclusion
7 Conclusion and Perspectives
7.1 Conclusion
7.2 Perspectives
7.2.1 Convolutional Neural Networks
7.2.2 Facial analysis with Convolutional Neural Networks . . .
A Excerpts from the used face databases
A.1 AR
A.2 BioID
A.3 FERET
A.4 Google Images
A.5 ORL
A.6 PIE
A.7 Yale
Acknowledgments
First of all, I would like to thank Dr. Christophe Garcia for his guidance and support over the last three years. This work would not have been possible without his excellent scientific as well as human qualities and the enormous amount of time he spent for me.
I also want to express my gratitude to my supervisor Prof. Dr. Hans Burkhardt who accompanied me during my thesis, gave me helpful advice and who always welcomed me in Freiburg.
Further, I would like to thank all my colleagues at France Telecom R&D (now Orange Labs), Rennes (France) where I spent three very pleasant years. Notably, Franck Mamalet, Sébastien Roux, Patrick Lechat, SidAhmed Berrani, Zohra Saidane, Muriel Visani, Antoine Lehuger, Grégoire Lefebvre, Manolis Delakis and Christine Barbot.
Finally, I want to say thank you to my parents for their continuing support in every respect.
Abstract
In this work, we present the problem of automatic appearancebased facial analysis with machine learning techniques and describe common specific subproblems like face detection, facial feature detection and face recognition which are the crucial parts of many applications in the context of indexation, surveillance, accesscontrol or humancomputer interaction.
To tackle this problem, we particularly focus on a technique called Convolu tional Neural Network (CNN) which is inspired by biological evidence found in the visual cortex of mammalian brains and which has already been applied to many different classification problems. Existing CNNbased methods, like the face detection system proposed by Garcia and Delakis, show that this can be a very effective, efficient and robust approach to nonlinear image processing tasks such as facial analysis.
An important step in many automatic facial analysis applications, e.g. face recognition, is face alignment which tries to translate, scale and rotate the face image such that specific facial features are roughly at predefined positions in the image. We propose an efficient approach to this problem using CNNs and experimentally show its very good performance on difficult test images.
We further present a CNNbased method for automatic facial feature detec tion. The proposed system employs a hierarchical procedure which first roughly localizes the eyes, the nose and the mouth and then refines the result by detect ing 10 different facial feature points. The detection rate of this method is 96% for the AR database and 87% for the BioID database tolerating an error of 10% of the interocular distance.
Finally, we propose a novel face recognition approach based on a specific CNN architecture learning a nonlinear mapping of the image space into a lower dimensional subspace where the different classes are more easily separable. We applied this method to several public face databases and obtained better recognition rates than with classical face recognition approaches based on PCA or LDA. Moreover, the proposed system is particularly robust to noise and partial occlusions.
We also present a CNNbased method for the binary classification problem of gender recognition with face images and achieve a stateoftheart accuracy. The results presented in this work show that CNNs perform very well on various facial image processing tasks, such as face alignment, facial feature detection and face recognition and clearly demonstrate that the CNN technique is a versatile, efficient and robust approach for facial image analysis.
Zusammenfassung
In dieser Arbeit stellen wir das Problem der automatischen, erscheinungsbasier ten GesichtsAnalyse dar und beschreiben gängige, spezifische Unterprobleme wie z.B. Gesichts und GesichtsmerkmalsLokalisierung oder Gesichtserkennung, welche grundlegende Bestandteile vieler Anwendungen im Bereich Indexierung, Überwachung, Zugangskontrolle oder MenschMaschineInteraktion sind.
Um dieses Problem anzugehen, konzentrieren wir uns auf einen bestimmten Ansatz, genannt Neuronales FaltungsNetzwerk, englisch Convolutional Neural Network (CNN), welcher auf biologischen Befunden, die im visuellen Kortex von Säugetierhirnen entdeckt wurden, beruht und welcher bereits auf viele Klassifizierungsprobleme angewandt wurde. Bestehende CNNbasierte Methoden, wie das GesichtsLokalisierungsSystem von Garcia und Delakis, zeigen, dass dies ein sehr effektiver, effizienter und robuster Ansatz für nichtlineare BildverarbeitungsAufgaben wie GesichtsAnalyse sein kann.
Ein wichtiger Schritt in vielen Anwendungen der automatischen GesichtsAnalyse, z.B. Gesichtserkennung, ist die GesichtsAusrichtung und Zentrierung. Diese versucht das GesichtsBild so zu verschieben, zu drehen und zu vergrößern bzw. verkleinern, dass sich bestimmte Gesichtsmerkmale an vordefinierten BildPositionen befinden. Wir stellen einen effizienten Ansatz für dieses Problem vor, der auf CNNs beruht, und zeigen experimentell und anhand schwieriger Testbilder die sehr gute Leistungsfähigkeit des Systems.
Darüberhinaus stellen wir eine CNNbasierte Methode zur automatischen GesichtsmerkmalsLokalisierung vor. Das System bedient sich einem hierarchi schen Verfahren, das zuerst grob die Augen, die Nase und den Mund lokalisiert, und dann das Ergebnis verfeinert indem es 10 verschiedene Gesichtsmerkmals Punkte erkennt. Die Erkennungsrate dieser Methode liegt bei 96% für die AR Datenbank und 87% für die BioIDDatenbank mit einer FehlerToleranz von 10% des Augenabstandes.
Schließlich stellen wir einen neuen GesichtserkennungsAnsatz vor, welcher auf einer spezifischen CNNArchitektur beruht und welcher eine nichtlineare Abbildung vom Bildraum in einen niedrigdimensionalen Unterraum lernt, in dem die verschiedenen Klassen leichter trennbar sind. Diese Methode wurde auf verschiedene öffentliche GesichtsDatenbanken angewandt und erzielte bes sere Erkennungsraten als klassische GesichtserkennungsAnsätze, die auf PCA oder LDA beruhen. Darüberhinaus ist das System besonders robust bezüglich Rauschen und partiellen Verdeckungen.
Wir stellen ferner eine CNNbasierte Methode zum binären KlassifizierungProblem der Geschlechtserkennung mittels GesichtsBildern vor und erzielen eine Genauigkeit, die dem aktuellen Stand der Technik entspricht.
Die Ergebnisse, die in dieser Arbeit dargestellt sind, beweisen, dass CNNs sehr gute Leistungen in verschiedenen GesichtsBildverarbeitungsAufgaben erzielen, wie z.B. GesichtsAusrichtung, GesichtsmerkmalsLokalisierung und Gesichtserkennung. Sie zeigen außerdem deutlich, dass CNNs ein vielseitiges, effizientes und robustes Verfahren zur GesichtsAnalyse sind.
Résumé
Dans cette thèse, nous proposons le problème de l’analyse faciale basée sur l’apparence avec des techniques d’apprentissage automatique et nous décrivons des sousproblèmes spécifiques tels que la détection de visage, la détection de caractéristiques faciales et la reconnaissance de visage qui sont des composants indispensables dans de nombreuses applications dans le contexte de l’indexation, la surveillance, le contrôle d’accès et l’interaction hommemachine.
Afin d’aborder ce problème, nous nous concentrons sur une technique nommée réseau de neurones à convolution, en anglais Convolutional Neural Network (CNN), qui est inspirée des découvertes biologiques dans le cortex visuel des mammifères et qui a déjà été appliquée à de nombreux problèmes de classifica tion. Des méthodes existantes, comme le système de détection de visage proposé par Garcia et Delakis, montrent que cela peut être une approche très efficace et robuste pour des applications de traitement nonlinéaire d’images tel que l’analyse faciale.
Une étape importante dans beaucoup d’applications d’analyse facial, comme la reconnaissance de visage, constitue le recadrage automatique de visage. Cette technique cherche à décaler, tourner et agrandir ou reduire l’image de visage de sorte que des caractéristiques faciales se trouvent environ à des positions définies préalablement dans l’image. Nous proposons une approche efficace pour ce problème en utilisant des CNNs et nous montrons une très bonne performance de cette approche sur des images de test difficiles.
Nous présentons également une méthode basée CNN pour la détection de caractéristiques faciales. Le système proposé utilise une procedure hiérarchique qui localise d’abord les yeux, le nez et la bouche pour ensuite affiner le résultat en détectant 10 points de caractéristiques faciales différentes. Le taux de détection est de 96 % pour la base AR et de 87 % pour la base BioID avec une tolérance d’erreur de 10 % de la distance interoculaire.
Enfin, nous proposons une nouvelle approche de reconnaissance de visage basée sur une architecture spécifique de CNN qui apprend une projection non linéaire de l’espace de l’image dans un espace de dimension réduite où les classes différentes sont séparables plus facilement. Nous appliquons cette méthode à plusieurs bases publiques de visage et nous obtenons des taux de reconnais sance meilleurs qu’en utilisant des approches classiques basées sur l’Analyse en Composantes Principales (ACP) ou l’Analyse Discriminante Linéaire (ADL). En outre, le système proposé est particulièrement robuste par rapport au bruit et aux occultations partielles.
Nous présentons également une methode basée CNN pour le problème de reconnaissance de genre à partir d’images de visage et nous obtenons un taux comparable à l’état de l’art.
Les résultats présentés dans cette thèse montrent que les CNNs sont très performants dans de nombreuses applications de traitement d’images faciales telles que le recadrage de visage, la détection de caractéristiques faciales et la reconnaissance de visage. Ils démontrent également que la technique de CNN est une approche très variée, efficace et robuste pour l’analyse automatique d’image faciale.
List of Figures
1.1 An example face under a fixed view and varying illumination . .
1.2 An example face under fixed illumination and varying pose . . .
1.3 An example face under fixed illumination and pose but varying facial expression
2.1 Active Appearance Models: annotated training example and cor responding shapefree patch
2.2 A leftright Hidden Markov Model
2.3 Two simple approaches to image analysis with 1D HMMs
2.4 Illustration of a 2D PseudoHMM
2.5 Graphical illustration of a linear SVM
2.6 The histogram creation procedure with the Bagoflocalsignature approach
2.7 The Perceptron
2.8 A MultiLayer Perceptron
2.9 Different types of activation functions
2.10 AutoAssociative Neural Networks
2.11 Typical evolution of training and validation error
2.12 The two possible cases that can occur when the minimum on the validation set is reached
2.13 A typical evolution of the error criteria on the validation set using the proposed learning algorithm
2.14 The evolution of the validation error on the NIST database using Backpropagation and the proposed algorithm
2.15 The validation error curves of the proposed approach with differ ent initial global learning rates
2.16 The architecture of a RBF Network
2.17 A twodimensional SOM with rectangular topology
2.18 Evolution of a twodimensional SOM during training
3.1 The model of a Scell used in the Neocognitron
3.2 The topology of the basic Neocognitron
3.3 Some training examples used to train the first two Slayers of Fukushima’s Neocognitron
3.4 The architecture of LeNet1
3.5 Convolution and subsampling
3.6 Error Backpropagation with convolution maps
3.7 Error Backpropagation with subsampling maps
LIST OF FIGURES
3.8 The architecture of LeNet5
3.9 A Space Displacement Neural Network
3.10 Illustration of a Siamese Convolutional Neural Network
3.11 Example of positive (genuine) and negative (impostor) error func tions for Siamese CNNs
3.12 The shunting inhibitory neuron model
3.13 The SICoNNet architecture
3.14 The connection scheme of the SCNN proposed by Gepperth . . .
3.15 The sparse, shiftinvariant CNN model proposed by Ranzato et al .
4.1 The architecture of the Convolutional Face Finder
4.2 Training examples for the Convolutional Face Finder
4.3 The face localization procedure of the Convolutional Face Finder
4.4 Convolutional Face Finder: ROC curves for different test sets . .
4.5 Some face detection results of the Convolutional Face Finder ob tained with the CMU test set
4.6 The three rotation axes defined with respect to a frontal head . .
4.7 The face alignment process of the proposed approach
4.8 The Neural Network architecture of the proposed face alignment system
4.9 Training examples for the proposed face alignment system
4.10 The overall face alignment procedure of the proposed system . .
4.11 Correct alignment rate vs. allowed mean corner distance of the proposed approach
4.12 Precision of the proposed alignment approach and the approach based on facial feature detection
4.13 Sensitivity analysis of the proposed alignment approach: Gaus sian noise
4.14 Sensitivity analysis of the proposed alignment approach: partial occlusion
4.15 Some face alignment results of the proposed approach on the Internet test set
5.1 Principal stages of the feature detection process of the proposed approach
5.2 Some input images and corresponding desired output feature maps
5.3 Architecture of the proposed facial feature detector
5.4 Eye feature detector: example of an input image with desired facial feature points, desired output maps and superposed desired output maps
5.5 Mouth feature detector: example of an input image with desired facial feature points, desired output maps and superposed desired output maps
5.6 Facial feature detector: virtual face images created by applying various geometric transformations
5.7 Facial feature detector: detection rate versus me of the four features
5.8 Facial feature detector: detection rate versus mei of each facial feature (FERET)
5.9 Facial feature detector: detection rate versus mei of each facial feature (Google images)
LIST OF FIGURES
5.10 Facial feature detector: detection rate versus mei of each facial feature (PIE subset)
5.11 The different types of CNN input features that have been tested
5.12 ROC curves comparing the CNNs trained with different input features (FERET database)
5.13 ROC curves comparing the CNNs trained with different input features (Google images)
5.14 ROC curves comparing the CNNs trained with different input features (PIE subset)
5.15 Sensitivity analysis of the proposed facial feature detector: Gaus sian noise
5.16 Sensitivity analysis of the proposed facial feature detector: partial occlusion
5.17 Facial feature detection results on different face databases
5.18 Overall detection rate of the proposed facial feature detection method for AR
5.19 Overall detection rate of the proposed facial feature detection method for BioID
5.20 Some results of combined face and facial feature detection with the proposed approach
6.1 The basic schema of our face recognition approach showing two different individuals
6.2 Architecture of the proposed Neural Network for face recognition
6.3 ROC curves of the proposed face recognition algorithm for the ORL and Yale databases
6.4 Examples of image reconstruction of the proposed face recogni tion approach
6.5 Comparison of the proposed approach with the Eigenfaces and Fisherfaces approach: ORL database
6.6 Comparison of the proposed approach with the Eigenfaces and Fisherfaces approach: Yale database
6.7 Sensitivity analysis of the proposed face recognition approach: Gaussian noise
6.8 Sensitivity analysis of the proposed face recognition approach: partial occlusion
6.9 Examples of training images for gender classification
6.10 ROC curve of the gender recognition CNN applied to the unmixed FERET test set
List of Tables
2.1 Comparison of the proposed learning algorithm with Backpropa gation and the bold driver method (10 hidden neurons)
2.2 Comparison of the proposed learning algorithm with Backpropa gation and the bold driver method (40 hidden neurons)
3.1 The connection scheme of layer C3 of Lenet5
4.1 Detection rate vs. false alarm rate of selected face detection meth ods on the CMU test set
4.2 The connection scheme of layer C2 of the Convolutional Face Finder
4.3 Comparison of face detection results evaluated on the CMU and MIT test sets
4.4 Execution speed of the CFF on different platforms
5.1 Overview of detection rates of some published facial feature de tection methods
5.2 Comparison of eye pupil detection rates of some published meth ods on the BioID database
6.1 Recognition rates of the proposed approach compared to Eigen faces and Fisherfaces
List of Algorithms
1 The Viterbi algorithm
2 The Adaboost algorithm
3 The standard online Backpropagation algorithm for MLPs
4 The proposed online Backpropagation algorithm with adaptive learning rate
5 The RPROP algorithm
6 The line search algorithm
7 A training algorithm for SelfOrganizing Maps
8 The online Backpropagation algorithm for Convolutional Neural Networks
Chapter Introduction
1.1 Context
The automatic processing of images to extract semantic content is a task that has gained a lot of importance during the last years due to the constantly increasing number of digital photographs on the Internet or being stored on personal home computers. The need to organize them automatically in a intel ligent way using indexing and image retrieval techniques requires effective and efficient image analysis and pattern recognition algorithms that are capable to extract relevant semantic information.
Especially faces contain a great deal of valuable information compared to other objects or visual items in images. For example, recognizing a person on a photograph, in general, tells a lot about the overall content of the picture.
In the context of humancomputer interaction (HCI), it might also be im portant to detect the position of specific facial characteristics or recognize facial expressions, in order to allow, for example, a more intuitive communication be tween the device and the user or to efficiently encode and transmit facial images coming from a camera. Thus, the automatic analysis of face images is crucial for many applications involving visual content retrieval or extraction.
The principal aim of facial analysis is to extract valuable information from face images, such as its position in the image, facial characteristics, facial expressions, the person’s gender or identity.
We will outline the most important existing approaches to facial image anal ysis and present novel methods based on Convolutional Neural Networks (CNN) to detect, normalize and recognize faces and facial features. CNNs show to be a powerful and flexible feature extraction and classification technique which has been successfully applied in other contexts, i.e. handwritten character recogni tion, and which is very appropriate for face analysis problems as we will exper imentally show in this work.
We will focus on the processing of twodimensional graylevel images as this is the most widespread form of digital images and thus allows the proposed approaches to be applied in the most extensive and generic way. However, many techniques described in this work could also be extended to color images, 3D data or multimodal data.
1.2 Applications
There are numerous possible applications for facial image processing algorithms. The most important of them concern face recognition. In this regard, one has to differentiate between closed world and open world settings. In a closed world application, the algorithm is dedicated to a limited group of persons, e.g. to recognize the members of a family. In an open world context the algorithm should be able to deal with images from “unknown” persons, i.e. persons that have not been presented to the system during its design or training. For example, an application indexing large image databases like Google images or television programs should recognize learned persons and respond with “unknown” if the person is not in the database of registered persons.
Concerning face recognition, there further exist two types of problems: face identification and face verification (or authentication). The first problem, face identification, is to determine the identity of a person on an image. The second one only deals with the question: “Is ‘X’ the identity of the person shown on the image?” or “Is the person shown on the image the one he claims to be?”. These questions only require “yes” or “no” as the answer.
Possible applications for face authentication are mainly concerned with ac cess control, e.g. restricting the physical access to a building, such as a corporate building, a secured zone of an airport, a house etc. Instead of opening a door by a key or a code, the respective person would communicate an identifier, e.g. his/her name, and present his/her face to a camera. The face authentication system would then verify the identity of the person and grant or refuse the access accordingly. This principle could equally be applied to the access to sys tems, automatic teller machines, mobile phones, Internet sites etc. where one would present his face to a camera instead of entering an identification number or password.
Clearly, also face identification can be used for controlling access. In this case the person only has to present his/her face to the camera without claiming his/her identity. A system recognizing the identity of a person can further be employed to control more specifically the rights of the respective persons stored in its database. For instance, parents could allow their children to watch only certain television programs or web sites, while the television or computer would automatically recognize the persons in front of it.
Video surveillance is another application of face identification. The aim here is to recognize suspects or criminals using video cameras installed at public places, such as banks or airports, in order to increase the overall security of these places. In this context, the database of suspects to recognize is often very large and the images captured by the camera are of low quality, which makes the task rather difficult.
With the vast propagation of digital cameras in the last years the number of digital images stored on servers and personal home computers is rapidly growing. Consequently, there is an increasing need of indexation systems that automatically categorize and annotate this huge amount of images in order to allow effective searching and socalled contentbased image retrieval. Here, face detection and recognition methods play a crucial role because a great part of photographs actually contain faces. A similar application is the temporal segmentation and indexation of video sequences, such as TV programs, where different scenes are often characterized by different faces.
illustration not visible in this excerpt
Figure 1.1: An example face under a fixed view and varying illumination
Another field of application is facial image compression, i.e. parts of images containing faces can be coded by a specialized algorithm that incorporates a generic face model and thus leads to very high compression rates compared to universal techniques.
Finally, there are many possible applications in the field of advanced Human Computer Interaction (HCI), e.g. the control and animation of avatars, i.e. com puter synthesized characters. Such systems capture the position and movement of the face and facial features and accordingly animate a virtual avatar, which can be seen by the interlocutor. Another example would be the facilitation of the interaction of disabled persons with computers or other machines or the automatic recognition of facial expressions in order to detect the reaction of the person(s) sitting in front of a camera (e.g. smiling, laughing, yawning, sleeping).
1.3 Difficulties
There are some inherent properties of faces as well as the way the images are captured which make the automatic processing of face images a rather difficult task. In the case of face recognition, this leads to the problem that the intraclass variance, i.e. variations of the face of the same person due to lighting, pose etc., is often higher than the interclass variance, i.e. variations of facial appearance of different persons, and thus reduces the recognition rate. In many face analysis applications, the appearance variation resulting from these circumstances can also be considered as noise as it makes the desired information, i.e. the identity of the person, harder to extract and reduces the overall performance of the respective systems.
In the following, we will outline the most important difficulties encountered in common realworld applications.
1.3.1 Illumination
Changes in illumination can entail considerable variations of the appearance of faces and thus face images. Two main types of light sources influence the overall illumination: ambient light and point light (or directed light). The former is somehow easier to handle because it only affects the overall brightness of the resulting image. The latter however is far more difficult to analyze, as face images taken under varying light source directions follow a highly nonlinear function. Additionally, the face can cast shadows on itself. Figure 1.1 illustrates the impact of different illumination on face images.
illustration not visible in this excerpt
Figure 1.2: An example face under fixed illumination and varying pose
Figure 1.3: An example face under fixed illumination and pose but varying facial expression
Many approaches have been proposed to deal with this problem. Some face detection or recognition methods try to be invariant to illumination changes by implicitly modeling them or extracting invariant features. Others propose a separate processing step, a kind of normalization, in order to reduce the effect of illumination changes. In section 4.3 some of these illumination normalization methods will be outlined.
1.3.2 Pose
The variation of head pose or, in other words, the viewing angle from which the image of the face was taken is another difficulty and essentially impacts the performance of automatic face analysis methods. For this reason, many applications limit themselves to more or less frontal face images or otherwise perform a posespecific processing that requires a preceding estimation of the pose, like in multiview face recognition approaches. Section 4.4 outlines some 2D pose estimation approaches that have been presented in the literature.
If the rotation of the head coincides with the image plane the pose can be normalized by estimating the rotation angle and turning the image such that the face is in an upright position. This type of normalization is part of a procedure called face alignment or face registration and is described in more detail in section 4.5.
illustration not visible in this excerpt
Figure 1.2 shows some example face images with varying head pose.
1.3.3 Facial Expressions
The appearance of a face with different facial expressions varies considerably (see Fig. 1.3). Depending on the application, this can be of more or less importance. For example, for access control systems the subjects are often required to show a neutral expression. Thus, invariance to facial expression might not be an issue in this case. On the contrary, in an image or video indexation system, for example, this would be more important as the persons are shown in everyday situations and might speak, smile, laugh etc.
In general, the mouth is subject to the largest variation. The respective person on an image can have an open or closed mouth, can be speaking, smiling, laughing or even making grimaces.
Eyes and eyebrows are also changing subject to varying facial expressions, e.g. when the respective person blinks, sleeps or widely opens his/her eyes.
1.3.4 Partial Occlusions
Partial occlusions occur quite frequently in realworld face images. They can be caused by a hand occluding a part of the face, e.g. the mouth, by long hear, glasses, sun glasses or other objects or persons.
In most of the cases, however, the face occludes parts of itself. For example, in a view from the side the other side of the face is hidden. Also, a part of the cheek can be occluded by the nose or an eye can be covered by its orbit for example.
1.3.5 Other types of variations
Appearance variations a also caused by varying makeup, varying haircut and the presence of facial hear (beard, mustache etc.).
Varying age is also an important factor influencing the performance of many face analysis methods. This is the case for example in face recognition when the reference face image has been taken some years before the image to recognize.
Finally, there are also variations across the subjects’ identities, such as race, skin color or, more generally, ethnic origin. The respective differences in the appearance of the face images can cause difficulties in applications like face or facial feature detection or gender recognition.
1.4 Objectives
The goals pursued in this work principally concern the evaluation of Convolutional Neural Networks (CNN) in the context of facial analysis applications. More specifically, we will focus on the following objectives:
 evaluate the performance of CNNs w.r.t. appearancebased facial analysis
 investigate the robustness of CNNs against classical sources of noise in the context of facial analysis
 propose different CNN architectures designed for specific facial analysis problems such as face alignment, facial feature detection, face recognition and gender classification
 improve upon the stateoftheart in appearancebased facial feature detec tion, face alignment as well as face recognition under realworld conditions
 investigate different solutions improving the performance of automatic face recognition systems
1.5 Outline
In the following chapter we will outline some of the most important machine learning techniques used for object detection and recognition in images, such as statistical projection methods, Hidden Markov Models, Support Vector Machines and Neural Networks.
In chapter 3, we will then focus on one particular approach, called Convolutional Neural Networks (CNN), which is the foundation for the methods proposed in this work.
Having described, among other aspects, the principle architecture and training methods for CNNs, in chapter 4 we will outline the problem of face detection and normalization and how CNNs can tackle these types of problems. Using an existing CNNbased face detection system, called Convolutional Face Finder (CFF), we will further present an effective approach for face alignment which is an important step in many facial analysis applications.
In chapter 5, we will describe the problem of facial feature detection which shows to be crucial for any facial image processing task. We will propose an approach based on a specific type of CNN to solve this problem and experimentally show its performance in terms of precision and robustness to noise.
Chapter 6 outlines two further facial analysis problems, namely automatic face recognition and gender recognition. We will also present CNNbased ap proaches to these problems and experimentally show their effectiveness com pared to other machine learning techniques proposed in the literature.
Finally, chapter 7 will conclude this work with a short summary and some perspectives for future research.
Chapter 2 Machine Learning Techniques for Object Detection and Recognition
2.1 Introduction
In this chapter we will outline some of the most common machine learning approaches to object detection and recognition. Machine Learning techniques automatically learn from a set of examples how to classify new instances of the same type of data. The capacity to generalize, i.e. the ability to success fully classify unknown data and possibly infer generic rules or functions, is an important property of these approaches and is sought to be maximized.
Usually, one distinguishes between three types of learning:
Supervised learning A training set and the corresponding desired outputs of the function to learn are available. Thus, during training the algorithm iteratively presents examples to the system and adapts its parameters according to the distance between the produced and the desired outputs.
Unsupervised learning The underlying structure of the training data, i.e. the desired output, is unknown and is to be determined by the training algorithm. For example, for a classification method this means that the class information is not available and has to be approximated by grouping the training examples using some distance measure, a technique called clustering.
Reinforcement learning Here, the exact output of the function to learn is unknown, and training consists in a parameter adjustment based on only two concepts, reward and penalty. That is, if the system does not perform well (enough) it is “penalized” and the parameters are adapted accord ingly. Otherwise, it is “rewarded”, i.e. some positive reinforcement takes place.
Most of the algorithms described in the following are supervised, but they are employed for rather different purposes: some of them are used to extract features from the input data, some are used to classify the extracted features, and others perform both tasks.
The application context varies also largely, i.e. some of the approaches can be used for detection of features and/or objects, some only for recognition and others for both. Further, in many systems a combination of several of the techniques described in this chapter is used. Thus, in a sense, they could be considered as some kind of building blocks for effective object detection and recognition systems.
Let us begin with some of the most universal techniques used in machine learning which are based on a statistical analysis of the data allowing to significantly reduce its dimensionality and extract valuable information.
2.2 Statistical Projection Methods
In order to be able to automatically analyze images, they are often resized to have a certain width w and height h. Then, the respective image rows or columns of each image are concatenated to build a vector of dimension n = w × h. The resulting vector space is called image space, denoted I in the following.
In signal processing tasks there is often a lot of redundancy in the respective images/vectors because, firstly, images of the same class of objects are likely to be similar and, secondly, neighboring pixels in an image are highly correlated.
Thus, it seems obvious to represent the images in a more compact form, i.e. to project the vectors into a subspace S of I by means of a statistical projection method. In the literature, the terms dimensionality reduction or feature selection are often employed in the context of these techniques. These methods aim at computing S which, in general, is of lower dimension than I, such that the transformed image vectors are statistically less correlated. There are two main groups of projections: linear and nonlinear projections.
Linear projection techniques transform an image vector x = (x1, . . . , xn)T , of dimension n into a vector s = (s1, . . . ,sk)T of dimension k, by a linear k × n transformation matrix W :
illustration not visible in this excerpt
In general, one eliminates those basis vectors that are supposed to contain the least important information for a given application using a predefined criteria. Thus, the dimension k of the resulting subspace S can be chosen after calculating the basis vectors spanning the entire subspace.
The most common and fundamental projection methods are the Principal Component Analysis (PCA) and the Linear Discriminant Analysis (LDA) which will be described in the following sections.
Nonlinear approaches are applied when a linear projection does not suffice to represent the data in a way that allows the extraction of discriminant features. This is the case for more complex distributions where mere hyperplanes fail to separate the classes to distinguish. As most of these approaches are iterative, they require an a priori choice of the dimension k of the resulting subspace S.
2.2.1 Principal Component Analysis
Principal Component Analysis (PCA), also known as the discrete Karhunen Loève Transform (KLT) or Hotelling Transform as it is due to Hotelling[101], is a linear orthogonal projection into the subspace where the first dimension (or axis) corresponds to the direction of I having the greatest variance, the second dimension to the direction with the second greatest variance and so on.
Thus, the resulting orthogonal subspace S, called principal subspace, de scribes best the distribution of the input space I. It finds the directions of greatest variance, which are supposed to reflect the most “important” aspects of the data.
Given a certain number N of input vectors [Abbildung in dieser Leseprobe nicht enthalten]that are assumed to have a multinormal distribution and to be centered, i.e. 1 [Abbildung in dieser Leseprobe nicht enthalten]
illustration not visible in this excerpt
where si ∈ Rk . Now let Σ be the covariance matrix of the input vectors
illustration not visible in this excerpt
Hence, the covariance matrix of the projected vectors si is defined as
illustration not visible in this excerpt
Finally, the projection matrix W is supposed to maximize the variance of the projected vectors. Thus,
illustration not visible in this excerpt
The k columns of W , i.e. the basis vectors of S, are called the principal components and represent the eigenvectors corresponding to the largest eigenvalues of the covariance matrix Σ.
An important characteristic of PCA is that if k < n the reconstruction error e in terms of the Euclidean distance is minimal,
illustration not visible in this excerpt
Thus, the first k eigenvectors form a subspace that optimally encodes or represents the input space I. This fact is exploited for example in compression algorithms and template matching techniques.
The choice of k depends largely upon the actual application. Additionally, for some applications it might not even be optimal to select the eigenvectors corresponding to the largest eigenvalues.
Kirby et al .[122] introduced a classical selection criteria which they call energy dimension. Let λj be the eigenvalue associated with the jth eigenvector. Then, the energy dimension of the ith eigenvector is:
illustration not visible in this excerpt
One can show that the Mean Squared Error (MSE) produced by the last n−i rejected eigenvectors [Abbildung in dieser Leseprobe nicht enthalten] . The selection of k now consists in determining a threshold τ such that Ek−1 > τ and Ek < τ .
Apart from image compression and template matching, PCA is often applied to classification tasks, e.g. the Eigenfaces approach[243] in face recognition. Here, the projected vectors si are the signatures to be classified. To this end, the signatures of the N input images are each associated with a class label and used to build a classifier. The most simple classifier would be a nearest neighbor classifier using an Euclidean distance measure.
To sum up, PCA calculates the linear orthogonal subspace having its axes oriented with the directions of greatest variances. It thus optimally represents the input data. However, in a classification context it is not guaranteed that in the subspace calculated by PCA the separability of the data is improved. In this regard, the Linear Discriminant Analysis (LDA) described in the following section is more suitable.
2.2.2 Linear Discriminant Analysis
The Linear Discriminant Analysis (LDA) has been introduced by Fisher[69] in 1936 but generalized later on to the socalled Fisher’s Linear Discriminant (FLD). It is, in contrast to the PCA, not only concerned with the best representation of the data but also with its separability in the projected subspace with regard to the different classes.
Let Ω = {x1, . . .xN} be the training set partitioned into c annotated classes denoted Ωi (i ∈ 1..c). We are now searching the subspace S that maximizes the interclass variability while minimizing the intraclass variability, thus improving the separability of the respective classes. To this end, one maximizes the socalled Fisher’s criterion [69, 16]:
illustration not visible in this excerpt
the betweenclass variance. Nj is the number of examples in Ωj (i.e. of class j)∑ and xj are the respective means, i.e. [Abbildung in dieser Leseprobe nicht enthalten] is the overall mean of the data which is assumed to be centered, i.e. x = 0.
The projection matrix W is obtained by calculating the eigenvectors associated with the largest eigenvalues of the matrix Σ−[1] w Σb. These eigenvectors form the columns of W .
A problem occurs when the number of examples N is smaller than the size of the input vectors, i.e. for images the number of pixels n. Then, Σw is singular since its rank is at most N −c. The calculation of Σ−[1] w isthusimpossible.Several approaches have been proposed to overcome this problem. One is to produce additional examples by adding noise to the images of the training database. Another approach consists in first applying PCA to reduce the input vector space to the dimension N − c and then perform LDA as described above.
2.2.3 Other Projection Methods
There are many other projection techniques proposed in the literature and which can possibly be applied to object detection and recognition. For example, Independent Component Analysis (ICA) [17, 2, 35, 109, 108] is a technique often used for blind source separation[120], i.e. to find the different independent sources a given signal is composed of. ICA seeks a linear subspace where the data is not only uncorrelated but statistically independent. In its most simple form, the model is the following:
illustration not visible in this excerpt
where x is the observed data, s are the independent sources and A is the so called mixing matrix. ICA consists in optimizing an objective function, denoted contrast function, that can be based on different criteria. The contrast function has to ensure that the projected data is independent and nonGaussian. Note that ICA does not reduce the dimensionality of the input data. Hence, it is often employed in combination with PCA or any other dimensionality reduction technique. Numerous implementations of ICA exist, e.g. INFOMAX[17], JADE [35] or FastICA[109].
Yang et al .[263] introduced the socalled twodimensional PCA, which does not require the input image to be transformed into a onedimensional vector beforehand. Instead, a generalized covariance matrix is directly estimated using the image matrices. Then, the eigenvectors are determined in a similar manner than for 1DPCA by minimizing a special criterion based on this covariance matrix. Finally, in order to perform classification a distance measure between matrix signatures has to be defined. It has been shown that this method outper forms onedimensional PCA in terms of classification rate[263] and robustness [254].
Visani et al .[252] presented a similar approach based on LDA: the two dimensional oriented LDA. The procedure is analogical to the 2DPCA method where the projection is directly performed on the image matrices, either column wise or rowwise. A generalized Fisher’s criterion is defined and minimized in order to obtain the projection matrix. Further, the authors showed that in contrast to LDA, the twodimensional oriented LDA can implicitly circumvent the singularity problem. In a later work[253], they generalized this approach to the Bilinear Discriminant Analysis (BDA) where columnwise and rowwise 2DLDA is iteratively applied to estimate the pair of projection matrices min imizing an expression similar to the Fisher’s criterion which combines the two projections.
Note that the projection methods presented so far are all linear projection techniques. However, in some cases the different classes cannot be correctly separated in a linear subspace. Then, nonlinear projection methods can help to improve the classification rate. Most of the linear projection methods can be made nonlinear by projecting the input data into a higherdimensional space where the classes are more likely to be linearly separable. That means, the separating hyperplane in this subspace represents a nonlinear subspace of the input vector space. Fortunately, it is not necessary to explicitly describe this higherdimensional space and the respective projection function if we find a so called kernel function that implements a simple dotproduct in this vector space and satisfies the Mercer’s condition (see Theorem 1 on p. 22). For a more formal explanation see section 2.6.3 on nonlinear SVMs. The kernel function allows to perform a dotproduct in the target vector space and can be used to construct nonlinear versions of the previously described projection techniques e.g. PCA [219, 264], LDA[161] or ICA[6].
The projection approaches that have been outlined in this section can in principal be applied to any type of data in order to perform a statistical anal ysis on the respective examples. A technique called Active Appearance Model (AAM)[41] can also be classified as a statistical projection approach but it is much more specialized to model images of deformable objects under varying external conditions. Thus, in contrast to methods like PCA or LDA, where the input image is treated as a “static” vector, small local deformations are taken into account. AAMs have been especially applied to face analysis, and we will therefore describe this technique in more detail in the following section.
2.3 Active Appearance Models
Active Appearance Models (AAM), introduced by Cootes et al .[41] as an exten sion to Active Shape Models (ASM)[43], represent an approach that statistically describes not only the texture of an object but also its shape. Given a new im age of the class of objects to analyze, the idea is here to interpret the object by synthesizing an image of the respective object while approximating as good as possible its appearance in the real image. It has mainly been applied to face analysis problems [60, 41]. Therefore, face images we will used in the following to illustrated this technique. Modeling the shape of faces appears to be helpful in most face analysis applications where the face images are subject to changes in pose and facial expressions.
2.3.1 Modeling shape and appearance
The basis of the algorithm is a set of training images with a certain number of an notated feature points, socalled landmark points, i.e. twodimensional vectors. Each set of landmarks is represented as a single vector x, and PCA is applied to the whole set of vectors. Thus any shape example can be approximated by the equation:
illustration not visible in this excerpt
where x is the mean shape, and Ps is the linear subspace representing the possible variations of shape parameterized by the vector bs. Then, the annotated control points of each training example are matched to the mean shape while warping the pixel intensities using a triangulation algorithm. This leads to a socalled shapefree face patch for each example.
illustration not visible in this excerpt
Figure 2.1: Active Appearance Models: annotated training example and corresponding shapefree patch
Figure 2.1 illustrates this with an example face image. Subsequently, a PCA is performed on the gray values g of the shapefree images forming a statistical model of texture:
illustration not visible in this excerpt
where g represents the mean texture, and the matrix Pg linearly describes the texture variations parameterized by the vector bg .
Since shape and texture are correlated, another PCA is applied on the concatenated vectors of bs and bg leading to the combined model:
illustration not visible in this excerpt
where c is a parameter controlling the overall appearance, i.e. both shape and texture, and Qs and Qg represent the combined linear shapetexture subspace. Given a parameter vector c, the respective face can be synthesized by first building the shapefree image, i.e. the texture, using equation 2.16 and then warping the face image by applying equation 2.15 and the triangulation algorithm used to build the shapefree patches.
2.3.2 Matching the model
Having built the statistical shape and texture models, the objective is to match the model to an image by synthesizing the approximate appearance of the object in the real image. Thus, we want to minimize:
illustration not visible in this excerpt
where Ii is the vector of grayvalues of the real image and Im is the one of the synthesized image.
The approach assumes that the object is roughly localized in the input image, i.e. during the matching process, the model with its landmark points must not be too far away from the resulting locations.
Now, the decisive question is how to change the model parameters c in order to minimize Δ. A good approximation appears to be a linear model:
illustration not visible in this excerpt
2.4. HIDDEN MARKOV MODELS
where A is determined by a multivariate linear regression on the training data augmented by examples with manually added perturbations. To calculate Ii − Im, the respective real and synthesized images are transformed to be shapefree using a preliminary estimate of the shape model. Thus, we compute:
illustration not visible in this excerpt
This linear approximation shows to perform well over a limited range of the model parameters, i.e. about 0.5 standard deviations of the training data. Finally, this estimation is put into an iterative framework, i.e. at each iteration we calculate:
illustration not visible in this excerpt
until convergence, where the matrix A is scaled such that it minimizes δg.
The final result can then be used, for example, to localize specific feature points, to estimate the 3D orientation of the object, to generate a compressed representation of the image or, in the context of face analysis, to identify the respective person, gender or facial expression.
Clearly, AAMs can cope with small local image transformations and ele gantly model shape and texture of an object based on a preceding statistical analysis of the training examples. However, the resulting projection space can be rather large, and the search in this space, i.e. the matching process, can be slow. A fundamentally different approach to take into account local transfor mations of a signal are Hidden Markov Models (HMM). This is a probabilistic method that represents a signal, e.g. an image, as a sequence of observations. The following section outlines this approach.
2.4 Hidden Markov Models
2.4.1 Introduction
Hidden Markov Models (HMM), introduced by Rabiner et al . [190, 191], are commonly used to model the sequential aspect of data. In the signal process ing context for example, they have been frequently applied to speech recog nition problems modeling the temporal sequence of states and observations, e.g. phonemes. An image can also be seen as a sequence of observations, e.g. image subregions, and here the image either has to be linearized into a one dimensional structure or special types of HMMs have to be used, for example twodimensional Pseudo HMMs or Markov Random Fields.
Being the most common approaches in image analysis, we will focus on 1D and Pseudo 2D HMMs in the following. The major disadvantage of “real” 2D HMMs is their relatively high complexity in terms of computation time.
A HMM is characterized by a finite number of states, and it can be in only one state at a time (as a finite state machine). The initial state probabilities define, for every state, the probability of the HMM being in that state at time t = 1. For each following time step t = 2..T it can either change the state or stay in the same state with a certain probability defined by the socalled transition probabilities. Further, in any state it creates an output from a predefined
illustration not visible in this excerpt
Figure 2.2: A leftright Hidden Markov Model
vocabulary with a certain probability, determined by the output probabilities. At time t = T the HMM will have produced a certain sequence of outputs, called observations; O = {o1, . . .,oT }. The sequence of states Q = {q1, . . . ,qT } it has traversed, however, is unknown (hidden) and has to be estimated by analyzing the observation whereas a single observation could be produced by different state sequences.
Fig. 2.2 illustrates a simple example of a HMM with 3 states. This type of HMM is called leftright model or Bakis model.
More formally we can describe a HMM as follows:
Definition 1 A Hidden Markov Model is defined as λ = {S, V, A, B, Π}, where
 S = {s1,...,sN} is the set of N possible states,
 V = {v1,...,vL} is the set of L possible outputs constituting the vocabu lary,
 A = {aij}i,j=1..N is the set transition probabilities from state i to state j,
 B = {bi(l)}i=1..N,l=1..L define the output probabilities of output l in state i,
 Π = {π1,...,πN} is the set of initial state probabilities.
Note that
illustration not visible in this excerpt
Given a HMM λ, the goal is to determine the probability of a new observation sequence O = {o1, . . . ,oT }, i.e. P[Oλ]. For this purpose, there are several algorithms, the most simple one being explained in the following section.
2.4.2 Finding the most likely state sequence
There are many algorithms for estimating P [Oλ] and the most likely state sequence Q∗ = {q∗[1], ...,q∗T} having generated O. The most well known of these are called Viterbi algorithm and BaumWelsh algorithm. Algorithm 1 describes the former which is a kind of simplification of the latter. Note that δti denotes
illustration not visible in this excerpt
2.4. HIDDEN MARKOV MODELS
Algorithm 1 The Viterbi algorithm for i = 1 to N do
the probability of being in state si at time t, and φti denotes the most probable preceding state being in si at time t. Thus, the φti store the most probable state sequence. The last loop allows to retrieve the final most likely state sequence Q∗ by recursively traversing φti.
When applying a HMM to a given observation sequence O it suffices for most applications to calculate P [Oλ] as stated above. The actual state sequence Q∗ however is necessary for the training process explained in the following section.
2.4.3 Training
In order to automatically determine and readjust the parameters of λ a set of training observations Otr = {ot1, . . . , otM } is used, and a training algorithm, for example algorithm 1, is applied to estimate the probabilities: P [Otr λ] and P [Otr , qt = siλ] for every state si at every time step t.
Then each parameter can be reestimated by regenerating the observation sequences Otr and “counting” the number of events determining the respective parameter. For example, to adjust aij one calculates:
illustration not visible in this excerpt
The output probabilities B and the initial state probabilities Π are estimated in an analogical way. However, the number and topology of states S has to be determined experimentally in most cases.
2.4.4 HMMs for Image Analysis
HMMs are onedimensional models and have initially been applied to the processing of audio data[190]. However, there are several approaches to adapt this technique to 2D data like images.
illustration not visible in this excerpt
Figure 2.3: Two simple approaches to image analysis with 1D HMMs
illustration not visible in this excerpt
Figure 2.4: Illustration of a 2D PseudoHMM
One of them[214] is to consider an image as a sequence of horizontal bands, possibly overlapping and spreading from top to bottom. Fig. 2.3(a) illustrates this. The HMM consequently has a leftright topology. Visual features of the image bands, e.g. pixel intensities, then correspond to the outputs of the HMM.
A similar approach is to partition the image into a set of blocks of predefined size. A onedimensional sequence is then formed by concatenating the lines (or columns) of blocks. Fig. 2.3(b) illustrates this procedure. Additional constraints can be added in order to ensure that certain states correspond to the end of lines in the image.
Finally, an approach called 2D PseudoHMM uses a hierarchical concept of superstates, illustrated in Fig. 2.4. The superstates form a vertical 1D sequence corresponding to the lines (or bands) of the image. Each superstate in turn contains a 1D HMM modeling the sequence of horizontal observations (pixels or blocks) in a line. Thus, determining the hidden state sequence Q of an observation O implies a twolevel procedure, i.e. first, to calculate the most likely sequence of superstates using the lines or bands of the image and, secondly, to determine the most likely sequence of substates corresponding to each line independently.
Obviously, HMMs are very suitable for modeling sequential data, and thus they are principally used in signal processing tasks. Let us now consider some more general machine learning techniques which do not explicitely model this sequential aspect but, on the other hand, can more easily and efficiently be applied to higher dimensional data such as images. Adaptive Boosting is one such approach and will be explained in the following section.
2.5 Adaboost
2.5.1 Introduction
Adaptive Boosting, short Adaboost, is a classification technique introduced by Freund and Schapire[70]. The basic idea here is to combine several “weak” classifiers into a single “strong” classifier, where the weak classifiers perform only slightly better than just random guessing.
The principle of the algorithm is to learn a global binary decision function by iteratively adding and training weak classifiers, e.g. wavelets networks or Neural Networks, while focusing on more and more difficult examples. It has been applied to many classification problems and has become a widely used machine learning technique due to its simplicity and performance in terms of classification rate and computation time.
2.5.2 Training
Let {(x1,y1), . . . ,(xm,ym)} be the training set where the xi ∈ X are the training examples and yi ∈ Y the respective class labels. We will focus here on the basic Adaboost algorithm where Y = {−1,+1} but extensions to multiclass classification have been proposed in the literature [71, 216].
The procedure is as follows: at each iteration t = 1..T a weak classifier ht : X → {−1,+1} is trained using the training examples weighted by a set of weights Dt(i), i = 1..m. Then, the weights corresponding to misclassified ex amples are increased and weights corresponding to correctly classified examples are decreased. Thus, the algorithm focuses more and more on harder examples. The final decision H(x) calculated by the strong classifier is then a weighted sum of the weak decisions ht(x) where the weights αt are chosen to be inversely proportional to the error ǫt of the classifier ht, i.e. if the error is large the respec tive classifier will have less influence on the final decision. Algorithm 2 describes the basic Adaboost algorithm. The variable Zt is a normalization constant in order to make Dt+1 a distribution.
illustration not visible in this excerpt
Thus, if γt > 0, i.e. each hypothesis is only slightly better than random, the training error drops exponentially fast.
Schapire et al .[215] also conducted theoretical studies in terms of the gen eralization error. To this end, they define the margin of the training examples
Algorithm 2 The Adaboost algorithm 1: D1(i) = 1/m ∀i = 1..m
illustration not visible in this excerpt
2.6. SUPPORT VECTOR MACHINES
rate, or structural risk R(α), is upper bounded by the training error rate, or empirical risk Remp and an additional term called VCconfidence which depends on the socalled VapnikChervonenkis (VC)dimension h of the classification function. More precisely, with the probability 1 − η, the following holds[246]:
illustration not visible in this excerpt
where α are the parameters of the function to learn and l is the number of training examples. The VCdimension h of a class of functions describes its “capacity” to classify a set of training data points. For example, in the case of a twoclass classification problem, if a function f has a VCdimension of h there exists at least one set of h data points that can be correctly classified by f, i.e. assigned the label −1 or +1 to it. If the VCdimension is too high the learning machine will overfit and show poor generalization. If it is too low, the function will not sufficiently approximate the distribution of the data and the empirical error will be too high. Thus, the goal of SRM is to find a h that minimizes the structural risk R(α), which is supposed to lead to maximum generalization.
2.6.2 Linear Support Vector Machines
Vapnik[246] showed that for linear hyperplane decision functions:
illustration not visible in this excerpt
the VCdimension is determined by the norm of the weight vector w.
Let {(xi,yi), . . . ,(xl,yl)} (xi ∈ Rn, yi ∈ {−1,+1}) be the training set. Then, for a linearly separable training set we have:
illustration not visible in this excerpt
The margin between the positive and negative points is defined by two hyper planes x · w + b = ±1 where the above term actually is zero. Fig. 2.5 illustrates this. Further, no points lie between these hyperplanes and the width of the mar gin is 2/w. The support vector algorithm now tries to maximize the margin by minimizing w, which is supposed to be an optimal solution, i.e. where generalization is maximal. Once the maximum margin is obtained, data points lying on one of the separating hyperplanes, i.e. for which equation 2.31 yields zero, are called support vectors (illustrated by double circles in Fig. 2.5).
To simplify the calculation, the problem is formulated in a Lagrangian frame work (see[246] for details). This leads to the maximization of the Lagrangians:
illustration not visible in this excerpt
Figure 2.5: Graphical illustration of a linear SVM
illustration not visible in this excerpt
where αi (i = 1..l) are the Lagrangian multipliers that are to be determined. Further, the solutions to αi and condition 2.31 imply a value for b. Note that all αi are zero except those corresponding to the support vectors.
Finally, new examples can simply be classified using the decision function 2.30.
In many cases, however, the training data cannot be completely separated because of some “outliers”. Then, we might simply loosen the constraint 2.31 by introducing the constants ξi > 0 in the following way:
illustration not visible in this excerpt
and condition 2.35 becomes
illustration not visible in this excerpt
2.6.3 Nonlinear Support Vector Machines
In order to use a nonlinear decision function, the above formulas can quite easily be generalized. Boser et al.[23] proposed a simple method based on the socalled kernel trick. That is, before applying the dot product xi·xj in equation
2.32 the ddimensional data is projected into a higher dimensional space where it is supposed to be linearly separable. Thus, a function Φ : Rd → H is defined and xi · xj becomes Φ(xi ) · Φ(xj ). Now, instead of calculating Φ each time we use a kernel function K(xi, xj) = Φ(xi ) · Φ(xj ), i.e. each occurrence of the dot product is replaced by K(·, ·). Thus, if we want to classify a new data point s the decision function
illustration not visible in this excerpt
2.7. BAG OF LOCAL SIGNATURES
illustration not visible in this excerpt
With the kernel function K we don’t need to calculate Φ or H but we must know if for a given K there exists a mapping Φ and some space H in which K is the dot product K(xi, xj) = Φ(xi) · Φ(xj ). This property is ensured by the Mercer’s condition[246]:
Theorem 1 There exists a mapping Φ and an expansion ∑
illustration not visible in this excerpt
2.6.4 Extension to multiple classes
Up to this point, we only considered twoclass problems. However, there are simple ways to extend the SVM method to several classes. One approach, called oneagainstall, consists in training one classifier for each class that distinguishes between the examples of that class and the examples of all other classes. Thus, the number of SVMs equals the number of classes n.
Another approach trains a SVM for each possible pair of classes. To classify an example, it is input to each SVM and the class label corresponding to the maximal number of “winning” SVMs represents the final answer. The number of classifiers needed by this approach is n(n−1)/2, which is a drawback in terms of complexity compared to the first approach.
2.7 Bag of Local Signatures
As opposed to SVMs, being a very general classification technique, an approach called BagofLocalSignatures (BOLS) has recently been introduced by Csurka et al .[51] for image classification problems, particularly object detection and
illustration not visible in this excerpt
Figure 2.6: The histogram creation procedure with the Bagoflocalsignature approach: a) input image I, b) detected salient points, c) extracted local signatures, d) quantized vectors (dictionary entries), e) histogram h(I). recognition. It was motivated by the bagofwords approach for text categorization which simply counts the number of predefined key words in a document in order to classify it into one of several categories.
In the first step of the BOLS method, n salient points pi = (xi , yi) of the input image are detected using an interest point detection algorithm, e.g. the Harris affine detector[162]. The small image region around each detected point is then represented by some local descriptors, such as the ScaleInvariant Feature Transform (SIFT) descriptors[148], leading to a local signature si for each salient point.
In the following step, the extracted signatures are classified applying any kind of vector quantization method. To this end, a dictionary of k representative signatures dj (j = 1..k) is calculated from the training set using a clustering algorithm. For example, Csurka et al .[51] used the kmeans clustering algorithm and Ros et al .[199] used a SelfOrganizing Map (SOM).
Thus, for an image I to classify a bag of local signatures vi, i.e. entries of the dictionary, is obtained representing the appearance of the object in the image. However, for two different images of the same object the respective representations might differ due to the varying appearance in different views or partial occlusions making an efficient comparison difficult.
Therefore, discrete histograms h(I) of the bag of local signatures vi are calcu lated by simply counting the number of occurrences of the respective signatures. Finally, the histograms can be classified by using classical histogram distance measures, such as χ[2] or the Earth Mover’s Distance (EMD) or by training a classifier on the vectors obtained from the histogram values, such as a Bayes classifier or SVMs[51].
Figure 2.6 illustrates the overall procedure for generating the BagofLocal Signatures representation. A major advantage of this approach compared to statistical projection methods, for example, is its robustness to partial occlusions and to changing pose of the object to recognize. This is due to the purely local representation and the rotation and scaleinvariant description of the local image patches.
As this technique is a relatively new approach in the field of machine learning and very specific to image classification we won’t describe it here in more detail. We will rather concentrate on a very versatile and powerful machine learning technique constituting the basis for all of the face analysis approaches proposed in this work, namely Artificial Neural Networks.
illustration not visible in this excerpt
Figure 2.7: The Perceptron
2.8 Neural Networks
2.8.1 Introduction
Artificial Neural Networks (ANN), short Neural Networks (NN), denote a ma chine learning technique that has been inspired by the human brain and its capacity to perform complex tasks by means of interconnected neurons per forming each a very simple operation. Likewise, a NN is a trainable structure consisting of a set of interconnected units, each implementing a very simple function, and together eventually performing a complex classification function or approximation task.
2.8.2 Perceptron
The most well known type of neural unit is called Perceptron and has been introduced by Rosenblatt[200]. Its basic structure is illustrated in Fig. 2.7. It has n inputs and one output where the output is a simple function of the sum of the input signals x weighted by w and an additional bias b. Thus,
illustration not visible in this excerpt
Often, the bias is put inside the weight vector w such that w0 = b and the input vector x is extended correspondingly to have x0 = 1. Equation 2.46 then becomes:
illustration not visible in this excerpt
where φ is the Heavyside step function: φ:R→R
illustration not visible in this excerpt
The Perceptron thus implements a very simple twoclass classifier where w is the separating hyperplane such that w · x ≥ 0 for examples from one class and w · x < 0 for examples from the other.
In 1962, Rosenblatt introduced the perceptron convergence theorem[201], a supervised training algorithm capable of learning arbitrary twoclass classi fication problems. However, Minsky and Papert[163] pointed out that there are very simple classification problems where the perceptron fails, namely when the two classes are not linearly separable like in the XORproblem, where the
illustration not visible in this excerpt
Figure 2.8: A MultiLayer Perceptron
pattern (0, 0) and (1, 1) belong to one class and (0, 1) and (1, 0) to the other. This motivated the use of several interconnected perceptrons which are able to form more complex decision boundaries by combining several hyperplanes. The most common type of these NNs is the MultiLayer Perceptron described in the following section.
2.8.3 MultiLayer Perceptron
MultiLayer Perceptrons (MLP) are capable of approximating arbitrarily complex decision functions. With the advent of a practicable training algorithm in the 1980’s, the socalled Backpropagation algorithm[208], they became the most widely used form of NNs.
Fig. 2.8 illustrates the structure of a MLP. There is an input layer, one or more hidden layer(s) and an output layer of neurons, where each neuron except the input neurons implements a perceptron as described in the previous section. Moreover, the neurons of one layer are only connected to the following layer. We call this type of network: feedforward network, i.e. the activation of the neurons is propagated layerwise from the input to the output layer. And if there is a connection from each neuron to every neuron in the following layer, as in Fig. 2.8, the network is called fully connected. Further, the neurons’ activation function has to be differentiable in order to adjust the weights by the Backpropagation algorithm. Commonly used activation functions are for example:
illustration not visible in this excerpt
for any θ > 0, where Pr[·] denotes the empirical probability on the training set and d the VCdimension of the weak classifiers.
Adaboost is a very powerful machine learning technique as it can turn any weak classifier into a strong one by linearly combining several instances of it. A completely different classification approach called Support Vector Machine (SVM) is based on the principal of Structural Risk Minimization which not only tries to minimize the classification error on the training examples but also takes into account the ability of the classifier to generalize to new data. The following section explains this approach in more detail.
2.6 Support Vector Machines
2.6.1 Structural Risk Minimization
The classification technique called Support Vector Machine (SVM) [23, 246, 44] is based on the principle of Structural Risk Minimization (SRM) formulated by Vapnik et al .[245]. One of the basic ideas of this theory is that the test error
illustration not visible in this excerpt
[...]
 Quote paper
 Dr. Stefan Duffner (Author), 2008, Face Image Analysis with Convolutional Neural Networks, Munich, GRIN Verlag, https://www.grin.com/document/133318

Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X. 
Upload your own papers! Earn money and win an iPhone X.