Facial Emotion Recognition: Single-Rule 1–0 DeepLearning

In my attempt to build Artificial Emotional Intelligence I first turned my head to Deep Learning. The main reason being its recent success at cracking Computer Vision tasks, as I am currently working on the part that detects emotions from faces (I already have the part that understands the content of our written words). So I spent the last month and a half taking online courses, reading online books, and learning a Deep Learning tool. To be honest, that was the easy part. The real challenge was amassing a descent dataset of faces classified by emotion. Why is that a challenge? Because Deep Learning algorithms are data-hungry!

In order to get a descent dataset, I collected face pics from Google Images, and cropped the faces with OpenCV (as described here). I was able to collect several thousand pics but my annotation approach failed due to many pics either not containing a face, or not having the right emotion. At the end, I ended up with just around 600 pics, a useless number for hungry Neural Networks. Looking around the Internet, I was able to crawl a pre-labeled, yet still small set from flickr user The Face We Make (TFWM). In a desperate (and probably the smartest move) I asked on the MachineLearning sub-reddit and someone saved my life. I was pointed to a collection from a Kaggle competition with over 35K pics labeled with 6 emotions plus a neutral class. I was suddenly filled with hope.

Armed with a larger dataset and my beginner’s skills on Deep Learning, I modified the two TensorFlow MNIST sample networks to train them with the 35k pics and test them with the TFWM set. I was thrilled and full of anticipation while the code was running, after all the simplest code (simple MNIST), a modest softmax regression, achieved 91% accuracy, while the deeper code (deep MNIST), a two-layers Convolutional Network achieved around 99.2% accuracy on the MNIST dataset. What a huge disappointment when the highest accuracy I got was 14.7% and 21.8% respectively. I then tried changing a few parameters like learning rates, number of iterations, switch from Softmax to ReLU in the last layer, but things did not change much. Somehow I felt cheated, so before spending more time exploring Deep Learning in order to build more complex networks, I decided to try a small experiment.

Having being working with emotions for a while, I have become familiar with research from psychologist Paul Ekman. In fact, most emotions detected by systems are somehow based on his proposed 6 basic emotions. The work that inspired my experiment is the Facial Action Coding System(FACS), a common standard to systematically categorize the physical expression of emotions. In essence, FACS can describe any emotional expression by deconstructing it into the specific Action Units (the fundamental actions of individual muscles or groups of muscles) that produced the expression. FACS has proven useful to psychologists and to animators, and I believe most emotion detection systems adapt it. FACS is complex, and to develop a system that uses it from scratch might take a long time. In my simple experiment, I identified 2 Action Units relatively easy to detect in still images: Lip Corner Puller, which draws the angle of the mouth superiorly and posteriorly (a smile), and Lip Corner Depressor which is associated with frowning (and a sad face).

Screen Shot 2016-02-16 at 11.04.18 PM
Fig. 1: A smile or joy, represented by the elevation of the corners of the mouth.
Screen Shot 2016-02-16 at 11.03.59 PM
Fig. 2: Sadness, represented by a depression of the corners of the mouth.

To perform my experiment, I considered only two emotions, namely joy and sadness. To compare with the adapted MNIST networks, I created a single-rule algorithm as follows. Using dlib, a powerful toolkit containing machine learning algorithms, I detected the faces in each image with the included face detector. For any detected face, I used the included shape detector to identify 68 facial landmarks. From all 68 landmarks, I identified 12 corresponding to the outer lips.

Screen Shot 2016-02-17 at 11.44.02 AM
Fig. 3: A face with 68 detected landmarks. White dots represent the outer lips.

Once having the outer lips, I identified the topmost and the bottommost landmarks, as well as the landmarks for the corners of the mouth. You can think of such points as constructing a bounding box around the mouth.

Screen Shot 2016-02-17 at 11.52.55 AM
Fig. 4: The topmost and bottommost landmarks in white, the corners of the lips in black.


Then the simple rule is as follows. I compute a mouth height (mh) as the difference between the y coordinates of the topmost and bottommost landmarks. I set a threshold (th) as half that height (th = mh/2). The threshold can be thought of as the y coordinate of a horizontal line dividing the bounding box into an upper and a lower region.

Screen Shot 2016-02-17 at 12.18.02 PM
Fig. 5: A bounding box defined by the 4 special landmarks and a threshold line dividing it into two regions.

I then compute the two “lip corner heights” as the difference between the y coordinates of the topmost landmark and both mouth corner landmarks. I take the maximum (max) of the “lip corner heights” and compare it to th. If max is smaller than the threshold, it means that the corner of the lips are in the top region of the bounding box, which represents a smile (by the Lip Corner Puller). If not, then we are in the presence of a Lip Corner Depressor action, which represents a sad face.

Screen Shot 2016-02-17 at 12.25.47 PM
Fig. 6: Lip corners just on the threshold line, which fails as a smile.

With this simple algorithm in place I then performed the experiment. For the NMIST networks, I extracted the related faces from the Kaggle set and I ended up with 8989 joy and 6077 sad training faces. For testing I had 224 and 212 faces respectively from the TFWM set. After training and testing, the simple NMIST network obtained 51.4% and the deep NMIST 55% accuracy, a significant improvement over the 7-classes version, but still a very bad performance. I then used the test set and ran the single-rule algorithm. Surprisingly, this single rule obtained an accuracy of 76%, a 21% improvement over the deep NMIST network.

There has been a long debate on whether Deep Learning algorithms are better than custom algorithms built based on some domain knowledge. Recently Deep Learning has outperformed many such algorithms in Computer Vision and Speech Recognition. I have no doubt about the power of Deep Learning, however, much has been said about how difficult it is to build a good custom algorithm and how easy it is to build a good neural network. The single-rule algorithm I just described is very simple and far from being a realistic system, however, this simple algorithm built in an afternoon beat something that took me over a month to understand. This is not by any means a definitive answer to the debate, but makes me wonder if custom algorithms are ready to be replaced by their Deep Learning counterparts. Custom algorithms are not only good, but as expressed in a previous post, they also give you the satisfaction of fully understanding what’s going on inside, a priceless feeling.

This was an experiment out of desperation and curiosity, it was never meant to deny or criticize the power of Deep Learning. All opinions expressed were felt during that particular moment and may change over the curse of my journey, in which I plan to build both a Deep Learning and a Custom facial emotion expression classifier.

Figures 1 and 2 were obtained from the amazing online tool ARTNATOMY by Victoria Contreras Flores (Spain 2005).

Thoughts on Motivation and Self-Motivated Software


Being raised by a non traditional Latin mother, I learned from an early age how to help doing the housework, I hated it nonetheless. Being married to a non traditional Vietnamese woman, she made sure that the housework was almost equally distributed between the two. At home I am in charge of washing the dishes and clothes, and cleaning the floors and toilets. I love being useful almost as much as I hate the tasks that I have to do. That’s why every time I see my one-year-old son sporting a wide smile while using his bibs to clean his toys, walls, and basically everything in his way, I get shocked. How can he enjoy so much something that I have to push myself to do?

My wife quickly explained to me that in order to keep a clean environment in my son’s daycare center, the teachers are always cleaning the walls, floors, and toys. One of the main ways in which babies learn is imitation, so it makes perfect sense that my son tries to replicate what the people taking care of him do. When anyone sees my son cleaning they applaud and laugh, and he also bursts in laughter. He seems to enjoy all these smiles, probably motivating him to keep cleaning. The same seems to be true in adults, we often behave in ways that tend to increase the acceptance by our peers, usually perceived in the forms of smiles and flattering.

It is said that state of the art emotion detection technologies can now achieve human-like accuracy at detecting happiness and sadness from facial expressions, tone of voice, and content of our words. I wonder why no one has tried to replicate the motivation shown by babies to create better software. Something as simple as letting a virtual home assistant automatically play music or TV shows that it has detected we like by reading our smiles through a home camera, or by what we tweet or post on Facebook. Is it really that difficult? Artificial Intelligence has already beaten the world champions at Chess, Jeopardy, and Go, so why can’t it beat a one-year-old at showing empathy.

Don’t Feed Me, Teach Me How to Fish


“Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime” reads the old proverb. After jumping on the Deep Learning bandwagon (but this applies to the Big Data, and any other bandwagon), I (my Neural Nets) have been constantly fed, and it has been great. I have been specially enchanted by how Convolutional Neural Networks (CNNs, ConvNets) have performed at classifying handwritten numbers, detecting cats, and spotting facial landmarks. The problem is that, that feast was already cooked and readily served on my (NNs) table. It was not my show, it was someone else’s.

If you are an expert in Machine Learning you will read this and think “he finally got it”. If you are just beginning and still in the tutorials (there are many, for many frameworks/packages/libraries, all convincing you they are the best) then you probably still don’t get my feeling. You might say “but the X tutorial on Y package explained very well (and pointed to nice materials) how a CNN works” and you are right, but how about the datasets? Data are the fish for our CNNs (and our RNNs and any other NN), but the data was not just fished (or hunted) for us (and our NNs), it was already nicely cooked and seasoned (cleaned, labeled, formatted).

When people claim Deep Learning will bring Machine Learning to the masses they are right, but except for a few experts, and others sitting on large amounts of data, the masses will be largely playing with MNISTCIFAR-10/100, and other datasets available through Kaggle competitions. You still don’t know what I mean? I invite you to follow a TensorFlow tutorial (the same applies for any other library, they all use the same datasets), skip all the explanations, just copy and paste the code, then see the magic happening. After the ecstasy wears off, define a new task, not recognizing numbers or airplanes, but maybe detecting emotions in faces (like I did) and good luck my friend.

Deep Learning, like Big Data before it, is not bringing Machine Learning (or Data Science) to the masses in the right way. It feels more like politicians before the elections, coming to the less fortunate with loads of food in order to win votes (this happens all the time in my country Honduras). To really bring Machine Learning to the masses it is necessary to teach us the nice tricks that top notch scientists use to collect and prepare data. To my head come things I have done myself before like crawling data sources (Twitter, Instagram, Google, …), crowdsourcing our labeling and cleaning tasks, and other more advanced techniques. To bring Machine Learning to the masses stop feeding the masses, teach them how to fish.

This is not a critique to Deep Learning. I understand there is a large number of people with some sort of experience dealing with data, and Deep Learning can become a very useful tool for them. The raison d’être of this article is that many are irresponsibly saying Deep Learning will bring Machine Learning to the masses, and the masses will come, and the masses will get disappointed, and suffer.

David and Goliath, or I Trying to Beat Microsoft at Facial Emotion Detection


During my quest to build emotion detection from faces, I had always assumed there was no available system to compare my own with. Not because there are no companies or hobbyist already building such thing, but because the companies making the headlines are keeping details of their work in obscurity. Turns out I was wrong. Last November Microsoft Research released, as part of their ongoing Project Oxford, a fantastic facial emotion detection demo, and I just found out. How could I miss it? I don’t know, but guess what? I now have something to compare my own system (after a build it) with.

So just how good is this MS system? It is darn good! For instance, I tried to use OpenCV to detect and crop the face in the image below to add it to my own dataset, but it was unable to detect a face in it (bad news for me since I was planning to use OpenCV as part of my system). Have a look at what the MS system did:


It correctly detected the face and the expressed emotion. So it seems like I will have a tough time trying to beat Microsoft. You may even think I am crazy to even try. The thing is, this is extremely motivating and exciting. I want to see how far I can get. Whether I beat MS or not is not important, this is my training, my personal journey, and who knows maybe once again David will defeat Goliath.

Show Me The Faces: Collecting Faces With Emotional Expression for Deep Learning


Teaching a machine how to recognize objects (cars, houses, and cats) is a difficult task, teaching it to recognize emotions is another story. If you have been following my posts, you then know that I want to teach machines to recognize human emotions. One important way in which machines can detect our feelings is by reading our faces. Teaching a machine to read faces has many challenges, and now that I started to tackle this problem I have encountered my first big one.

Deep Learning, a powerful tool used to teach machines seems promising for the task at hand, but in order to make use of it I needed to find the materials to teach the machine. Let me use an analogy to explain. For humans to learn to recognize objects, or in our specific case recognize facial expressions, a person has to be exposed to many faces. That’s not a big deal as we see faces everywhere from the second we are born. On the other hand, we don’t really have tools to take a computer into the wild and let it learn. So my big challenge was finding pictures or videos of people showing emotions in their face, to feed it to the machine and let it learn.

Companies like Google and Facebook, and some big labs in prestigious universities, have access to an enormous amount of data (just think of how many faces people tag on Facebook). However, mere mortals like me have to find not straightforward ways to collect humble amounts of data to teach our machines. So let me start by defining exactly what is the data I wanted to collect. To teach my machine to recognize emotions from facial expressions, I needed to collect pictures of faces expressing some emotion (angry faces or happy faces), and at the same time I need to explicitly tell the machine what the emotion is (this face shows anger). To be more exact, what I need to feed the machine is a collection of data pairs of the form[picture, emotion]. The question now is how to obtain such data?

First let me quickly tell you how you should not obtain this data. Many, including me, would first think about manually collecting thousands of pics from different sources (personal photos, Facebook, etc …), use a photo app to crop the faces (the learning is more efficient if the pic contains just the face), and manually define the emotion tag. This is time consuming, and not scalable. Let me explain what I did instead .First, many companies offer some automatic ways to pull data from their servers. The obvious choice for pics then might be Instagram (not Facebook as the data is not public). Now the problem with Instagram is that it’s not easy to specify that you want pics with faces. So in order to get exactly what I needed (faces with emotional expressions) my best choice was Google.

Google offers the Custom Search API, a tool to let programs pull data based on queries, much like humans would using the Google website. This was perfect for me, to understand why try the query scared look on Google (then go to images). So now I had an automatic way to get faces expressing emotions and I did not have to manually identify the emotion (it comes from the query). But wait, what about this picture:

“Big Man With Angry Eyes Points His Gun To Your Face”, obtained using the query “angry look”

“Big Man With Angry Eyes Points His Gun To Your Face”, obtained using the query “angry look”

The image was obtained with the query angry look, and it clearly has an angry face in it, but it also has an upper body, a gun, and many watermarks. This is not good as it will confuse my machine. How about this picture obtained with the query sad person:

An image obtained with the query “sad person” but without a person in it.

It clearly has no sad person, it has no person at all as it’s just a table. So while in most cases (when using appropriate queries) you will obtain faces with the intended emotion (like the angry man), it will most likely come with extra noise, or sometimes even not have a face at all. Again, the best way to deal with this is not manually, but using Computer Vision tools to remove the noise automatically.

After submitting many queries and downloading a few thousands pics (due to rate limitations this might span a few days), I automatically processed all the pics using the popular Computer Vision library OpenCV (free if you wonder). OpenCV comes pre-loaded with a set of nice filters to detect faces and other features (eyes, mouth, …) in pics. The results are magical:

Just an angry face thanks to OpenCV

OpenCV automatically detected a square region containing the face, and with additional commands, my program was able to automatically crop and reduce the face to a size and format appropriate to feed to my machine. Now what happened to the image without the face? OpenCV did not detect any face in it so it was automatically ignored. And Voila, that’s how I could efficiently (and free) start building a descent dataset of faces to later teach a computer how to detect our emotions.

To conclude, very often (depending on the query) you will find friends like this in the pictures:

Often you will encounter cartoon faces.

and OpenCV will of course return you this beauty:


Whether this is bad or not for the trained machine I still don’t know. I will find out when I move to the training process. Worst case, I have to manually remove a few faces (and other possible wrongfully detected objects). Best case, I have a machine that can know if my kids are watching happy cartoons.

Disclaimer: I don’t own any picture used in this article. Pictures will be removed if requested.


Deep Feelings About Deep Learning

So I want to build Artificial Emotional Intelligence (AEI), and I already wrote about a possible application to treat mental health problems. Even the big guns like Apple Inc. are trying to build AEI (for some obscure reason). So the obvious step when you want to build something is to study and to do research.

As much as I tried not to fall for the hype recently gained by Deep Learning, I could not really resist to explore their promises. Let me quickly explain. In order to build real AEI I wanted to start by the component that can understand our words. This belongs to the fields called Natural Language Processing (NLP), and Computational Linguistics (CL). Building powerful and useful NLP/CL systems is extremely challenging. It took me nearly 3 years to build a system that can guess your emotions from what you write, and the accuracy is far from perfect. The reason is that such systems are traditionally built using manually defined rules, features, and algorithms tailored for specific tasks.

Deep Learning, on the other hand, promises to replace handcrafted features with efficient algorithms able to “learn” the features automatically from some input data, saving you all the hard work. So yeah! When you think about this it makes sense to want to give it a try. And so I did. First I studied the basics of Artificial Neural Networks using the awesome Coursera Machine Learning Course. Then, to complement that knowledge I read this great online book, and checked these fantastic video tutorials. All that taught me to play with toy Deep Networks on code fully written by me. When I was ready I jumped to TensorFlow, a full-fledged Deep Learning software library and followed their tutorials to train Deep Networks to classify handwritten characters. My reaction? A rush of elation followed by a bit of disappointment.

Don’t take me wrong, Deep Learning is awesome. There is mathematical proof that in theory they can solve any problem. The handwritten characters classification tutorial, although simple, hints to that. Yet there is something about Deep Learning that leaves a sour taste in the mouth. During my previous research project, I always felt I was in control, and in most cases I could justify why things worked. With Deep Learning, it all felt like magic. Except for the valid mathematical intuition, you can’t really understand what’s going on inside the black box that is the constructed Networks. Moreover, even the state-of-the-art systems where constructed in an empirical way, by testing different network architectures until finding the best performer, with little clue of why it performs better.

So yes, Deep Learning can solve complex problems, yes it can save time and effort, but without clear understanding of what is going on inside, it might lead to many frustrations in the process. As soon as I move past the tutorials and into developing the first part of my AEI, I will post more about my feelings towards Deep Learning.

Building an Emotional Artificial Brain — Apple Inc.


Apple just bought Emotient, a startup that uses Artificial Intelligence to read people’s emotions by analyzing their facial expressions. Also, in October of 2015 it bought VocalIQ, another startup that uses speech technology to teach machines to understand the way people speak. As usual, Apple did not disclose the reasons behind the acquisitions.

Why is Apple interested in such technologies? To me the reasons are clear. If you have interacted with Siri, Apple’s virtual assistant for mobile devices, you have probably discovered how limited it is. My guess is that the company is trying to inject Siri with Artificial Emotional Intelligence (or some sort of Artificial Emotional Brain), in an attempt to make interactions with the system much more natural. The missing piece of the puzzle? Technology to understand not just our faces and tones, but the implicit emotions hidden in the meaning of hour words. If you haven’t yet, please have a look at my demo of Emotion Detection from Text.

You can read more about Apple’s acquisitions here

Building an Emotional Artificial Brain — Motivations

“An estimated one in five people in the U.S. have a diagnosable mental disorder.”and “… cost an estimated $467 billion in the U.S. in lost productivity and medical expenses ($2.5 trillion globally).” are some of the lines that can be read in an interesting article about Virtual Reality Therapy published in TechCrunch. Now just like me, you will probably be shocked to know that VR has been used to treat some types of mental disorders for decades. So why is it that most people have never heard of such thing? I suspect the reasons to be two.

First, VR has so far failed to enter the mainstream of popular tech. The main reasons for this are that in order to create real immersive virtual experiences it is necessary to have an absurd amount of resources, as free to roam worlds will need a higher level of details than the best video games out there. Moreover, in order to run VR software and display such virtual worlds, it is necessary to have an incredibly powerful hardware. Even the Oculus Rift, the device that promises to bring descent VR to the masses, requires you to have a high-end computer that many can’t afford (or don’t need).

Second, and the most important to me is that in the realm of mental health, VR has mostly been used to treat fears, phobias, and post-traumatic stress disorder (PTSD). To treat arachnophobia for instance, the patient can be safely exposed to virtual spiders in a virtual room, in order to help her overcome this problem. Special hardware can also be constructed to simulate a virtual airplane to help a patient overcome the fear of flying. Virtual “worlds at war” can be constructed to help veterans with PTSD. The complication is that, in order to treat other mental health disorders that require direct interaction with another human (like the trauma after being raped, or a child with autism), most VR therapies require the (often indirect) intervention of a trained therapist. The problem is that access to professionals of mental health is scarce.

So this is one case where the need for the Emotional Artificial brain becomes evident. If such technology existed, it could be incorporated into virtual therapists created using artificial intelligence. Such virtual humans could maintain conversations with the patient (trained using transcriptions of real sessions) and adapt the conversation (content and tone) based on the emotions being expressed by the patient (tone, content, facial expressions). If such technology existed, it could bring VR-based therapy even closer to the masses, and who knows, maybe in 5–10 years no one would be surprised to hear about VR being used to fight The Global Mental Health Crisis.

The TechCrunch article that inspired this writing is this: Virtual Reality Therapy: Treating The Global Mental Health Crisis

Building an Emotional Artificial Brain — The Beginning

In October last year I finally got my long awaited PhD degree. My research topic was Sentiment Analysis (SA), a sub-field of Artificial Intelligence and Natural Language Processing that seeks to identify the polarity of any given text. Put in simpler words, given any subjective text, SA seeks to tell wether the text is positive, negative, or neutral. The result of my long research is a set of short and poorly optimized algorithms that, when combined in a pre-defined order, can yield a very simple emotions classifier. Yes, emotions, not sentiment, which means this classifier can guess (very often wrongly) which of hundreds of emotions a subjective text expresses.

After setting up a working demo of the emotions detection system, my excitement grew quickly. The next obvious step for me was to build a company and create consumer apps that use the technology. Every single person that heard my plan and tried the demo was excited too. Nonetheless, my plan failed. The reason is simple, I realized that a system that can simply guess an emotion from your text is a very crude representation of what an ideal empathetic system should be. So I moved on and I started another unrelated company, with the hopes of one day reviving my previous dream.

Being a tech entrepreneur, I spend much of my time reading the latest tech news. Some of the most discussed trends in recent days are Virtual Reality, Smart everything (homes, cars, devices, etc.) and the rise (and fear) of AI. With every new article read came a new rush of hope, but the fundamental question still remained: How to apply my algorithms and knowledge to these trending areas? I think I have finally found an answer, and this is what this writing, and the ones that will follow are about.

During the last days of 2015 I decided to build an Emotional Brain, or an Artificial Amygdala to be more specific. This is half personal project, half an attempt to predict what will be one of the main components of any future tech humans interact with. Concretely, I will attempt to build a series of algorithms that together can listen, read, and see us, understand our feelings, and reply or act accordingly. A true empathic system. I don’t know if it will be a full fledged conversational agent, or just a simple component of a whole, like an operating system module. Whatever it turns out to be, I will be telling the story in a series of short articles with varying formats. Some might feel like short research papers, others like a story, and others maybe like a random dump of my mind.