Classifying Facial Emotions via Machine Learning

7 min readMay 6, 2019

Index

What is an Emotion?

An emotion is a mental and physiological state which is subjective and private. It involves a lot of behaviors, actions, thoughts and feelings. Facial expressions can be considered not only as the most natural form of displaying human emotions but also as a key nonverbal communication technique.

In 1969, after recognizing a universality among emotions in different groups of people despite the cultural differences, Paul Ekman and Friesen classified basic emotional expressions to be universal: happiness, sadness, disgust, surprise, contempt, anger and fear.

Later they developed a Facial Action Coding System to systematically describe all the combination of emotions using facial micro-expressions. This problem can be tackled by hard coding the action units of face programmatically but we are going to follow machine learning approach to tackle this problem.

[Go to Index]

Work-plan to solve the Problem

As a pre-processing step, the images are cropped around the faces and intensity-normalized. ☍
Compute Local Descriptors and extract features from the processed images. ☍
Clustering of features using K-means algorithm to create a Bag of Words(BOW) vector representation of Image. ☍ OR
Compute a Vector of Locally Aggregated Descriptors(VLAD) representation. ☍
Finally train the labelled dataset on the emotion classes using SVM classifier. ☍

Let’s dive deep into each step.

[Go to Index]

Step 1: Pre-processing Images

Extract faces from the image using haarcascade, resize and apply denoising filters.

Identify face from the image using haarcascade classifier. Crop the image into (256 x 256) resolution. Apply denoising filter.

Note: haarcascades are trained classifiers for detecting objects of a particular type, e.g. faces (frontal, profile), pedestrians etc.

[Go to Index][Go to Workplan]

Step 2: Extract Features from faces

Now we have faces separated out from the redundant content of the image. It’s time to extract some important features of the face.

Brief Introduction to SIFT :

BoW featurizations rely on a before hand computation of local descriptors of images. We’ll use Scale-Invariant Feature Transform (SIFT) descriptor to capture features from the face. Their computations rely on the Gaussian scale space of the image, and the corresponding Difference of Gaussians (DoG) that detects images keypoints like edges.

The SIFT is defined as a 128-dimensional vector describing the image intensity by computing local gradient patches at these keypoints.

Very interesting properties of this descriptor with respect to our problem is it’s invariance to affine transformations, and robustness to changes in illumination, noise, and small changes in view point.

Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters.

Thus they may capture facial expression even if the subjects head appears with some angle.

[Go to Index][Go to Workplan]

Step 3: Clustering the features

Encoding :

It quantizes image patches that constitute the image to be classified. It works by first running K-means on the set of all pixel patches that we collect using SIFT descriptors.

Visual representation to understand the of finding patches in image

This builds what is known as dictionary/codebook represented by the centroids obtained from the clustering. At the end of this process, you end up with K-representative “visual words”.

These “visual words” represent what is usually understood as your visual vocabulary.

Once you have these visual words, encoding is the process of assigning each of these patches within your image the to nearest neighbor in the dictionary.

Pooling :

BoW Featurization : BoW representation is popular for indexation and categorization applications. Also, these vector representations can be compared with standard distances, and subsequently be used by robust classification methods such as support vector machines.
It refers to the process of representing an image as a “bag of visual words”.

The word bag here is meant to convey that once we have encoded each patch with a word(a number between 1 and K), we build a new representation that discards the spatial relationship between words that constitute the image.
It is the process of building feature vector that is a histogram of visual words (as opposed to putting the full “image” in the feature vector). The descriptor will be invariant to changes in “the ordering of words/patches”.

In computer vision context this translates into “invariance with respect to rotations and distortions of the image”.

Each descriptor of dimension d from an image is strictly assigned to a cluster. The final BoW representation is obtained as the histogram counting the occurrences of each visual word in the image. Therefore, it produces a k-dimensional vector.

[Go to Index][Go to Workplan][Go to Encoding]

VLAD Featurization : In this representation, we propose a vector representation of an image which aggregates descriptors based on a locality criterion in feature space. It can be seen as an extension to BOW representation. the dimension 𝙳 of our representation is D = k * d.
where k = visual words (centroids) with k-means
d = dimension of local descriptor

Let’s understand VLAD visually

Learn a codebook C = {c1, …ck} of k centroids with k-means.
① Each local descriptor x is associated to its nearest centroid cᵢ = NN(x)

② The idea is to accumulate, for each centroid c, the differences x - cᵢ of the vectors x assigned to cᵢ.

This characterizes the distribution of the vectors with respect to the centroid.

③ A Component of 𝚟 is obtained as a sum over all the image descriptors
The vector 𝚟 is subsequently L2-normalized

VLAD Computation steps summary

dimension D = k*128, here d = 128

The primary advantage of VLAD over BoW is that we add more discriminative property in our feature vector by adding the difference of each descriptor from the mean in its cell.

[Go to Index][Go to Workplan][Go to Encoding][Go to Pooling]

Step 4: Training the dataset and Conclusion

Dataset Used :

We have used a subset of the CK+ (expanded Cohn-Kanade) dataset, with 1125 face images. Each image has a label from a set of 8 emotions:

0 = neutral, 1 = anger, 2 = contempt, 3 = disgust, 
4 = fear, 5 = happy, 6 = sadness and 7 = surprise

Images from Dataset. From left to right: Surprise (label 7), Happy (label 5) and Disgust (label 3)

Dataset Processing :

Each subject’s folder contains varying emotions from neutral to high intensity. For each subject(person), we have extracted 4 images of corresponding labelled expressions and 1 image with neutral expression.

For each emotion label, we have 4 images with varying intensity of facial expression. We have replicated the label 4 times to match the length of the list of image filenames and neutral files and labels(0=neutral) are appended at the end of the lists.

Training :

SVM classifier is trained on 5 emotion classes due to some missing data. 80% of dataset is used as training set and rest 20% as test set. The calculated accuracy on the test set using both featurization are as follows :

Used SVM with the “one vs the rest” strategy for multi-class classification.

Maximum accuracy for BoW (79.89 %) is achieved at K=100 and for VLAD (97.35 %) it’s achieved at K=40.

Comparison of Confusion Matrices (K=40, 5 classes)

There is great improvement in accuracy by using VLAD vector relative to the simple BoW vector.

It works quite well in real time too

There are few limitations to this model and a lots of further improvement is possible. Comparison of this approach with Deep CNNs and finding out limitations of this approach is left for the readers to explore.