Recognizing emotions in facial expressions is relatively straightforward for humans, and in recent times machines are getting better at it too. The applications of emotion-detecting computers are numerous, from improving advertising to treating depression, the possibilities are limitless. Motivated mainly by the impact in mental health that such technology can have, I started building my own emotion recognition technology.
In a previous post I described a quick test in which I used ideas drawn from research on how facial expressions are decomposed. In this simplified scenario a computer distinguished between sad and happy faces by detecting facial landmarks (points of eyes, mouth, etc …) and using one simple geometric feature of the mouth (representing a Lip Corner Puller). That single-rule algorithm was correct 76% of the time. As usual I got quickly overexcited and started defining other geometric features to improve and extend to six basic emotions (anger, disgust, fear, joy, sadness, and surprise).
To detect a Cheek Raiser, which basically closes the eyelids, and it’s more obvious when we laugh, I used the ratio of the height to the width of the eyes. To detect an Inner Brow Raiser, which basically raises the inner brows, and is characteristic of emotions like sadness, fear, and surprise, I computed the slope of a line crossing the landmarks representing the inner and outer brows.
As you can guess by now, manually identifying geometric features, to represent the nearly 20 actions necessary to perform the 6 basic emotions, got crazy hard pretty quickly. Not to mention that many were just impossible to define just using the landmarks (a Brow Lowerer just wrinkles the forehead). Even if I could successfully define them all, determining how to effectively combine them to detect an emotion would be just impossible by hand.
So I went back to machine learning, which essentially let’s a machine learn how to efficiently combine features to classify or detect things. To make my life easier, instead of manually defining the geometric features, I decided to just feed the machine a series of lengths of the lines representing a face mesh (as described here). The idea is that such lengths will vary from emotion to emotion as a representation of muscle contractions and extensions.
Given such lengths, 178 to be more precise, a classifier can be trained to recognize different emotions. In my particular case I tried the popular Support Vector Machines (SVM) and a Logistic Regression (Logit) classifier, trained on around 20,000 low-res images (48×48 pixels). Logit gave better results across 3 completely different test sets. For the NimStim Face Stimulus Set (574 faces) it achieved 54% accuracy, for a subset of images crawled from flickr user The Face We Make (850 faces) it achieved 55%, and for a set collected from Google Image Search and manually labeled by me (734 faces) it achieved 49% accuracy. The performance is not exactly human-like, and there are certainly systems way more accurate, but it is worth remembering that it only uses 178 features, and was trained in less than a minute in a laptop (as opposed to hours in multiple GPUs for state-of-the-art systems).
Finally, some papers I have surveyed mention that state-of-the-art accuracy can be achieved by combining geometric features with texture features. Texture features can be used to detect wrinkles in forehead, nose, and other parts of the face resulting from certain facial expressions. In the near future I will learn how to extract and try such features.