An old Soul Hackers Labs’ trick now powered by Oak-D. People tracking and face detection happens on edge device. The detected face is fed to SHL’s face analyzer (on host) to determine the age, gender, emotions, attention level, viewing time of the tracked person. Metadata is produced and stored to generate useful reports for advertisers.
Tracking and detecting faces was the most resource consuming part and now the host computer has been freed from this burden, thanks Oak-D!
One of the steps of teaching the Self-Driving Ads Robot (SEDRAD) how to navigate the environment is to teach it which surfaces are good for it to drive on. We began the tedious task of collecting images of the surrounding, and creating a segmentation of the images to later “teach” SEDRAD to recognize the surface to follow and stay on. This process comprises image acquisition, image annotation (segmenting in our case), and then segmentation validation. It is very time consuming. We also ran into a problem. Due to technical issues, the real SEDRAD was not available over the weekend, when we collected the data. Instead, we mounted the cameras at the same height and separation from each other on a wagon. Annotation here we go!
When we applied to the #OpenCV Spatial AI Competition #Oak2021, the very first issue we told the organizers we were going to solve using an Oak-D stereo camera was the inability of our robot to avoid obstacles located lower than the range of its 2D lidar. Back then we had no idea how we were going to do this, but we knew a stereo camera could help. In this video we present our solution. The video does not show it yet in action during autonomous navigation, but it explains how we will be turning depth images from two front facing Oak-D cameras to create 2 virtual 2D lidars that can avoid obstacles located near the floor.
We are pleased to announce that we are officially part of the second phase of the OpenCV AI Competition #Oak2021. Our team joins over 200 team selected worldwide among hundreds of participants. As a price, OpenCV and Luxonis have awarded us a certificate and a free Oak-D camera (to join the 3 others we already owned) to help us develop our self-driving ads robot. Stay tuned for more updates.
When it comes to Taiwan and South East Asia, no accelerator is bigger and more impactful than AppWorks. With 275 startups accelerated, AppWorks has come to raise about US$ 222M. With their vast network of human resources, it is a no brainer to want to join them.
We are happy to announce that starting July 29, 2016, we will be joining this prestigious institution as part of their batch #13. Soul Hackers Labs, together with over 30 other teams from Taiwan, Hong Kong, Macau, Malaysia, Singapore, and New York, will be spending the next 6 months building a sustainable business.
AppWorks will be providing us with office space, mentorship, connections, access to computing resources (through partnerships with Amazon AWS and Microsoft BizSpark) to help us boost our path to success. We are thrilled to see all the great things we will be building during this time. Please stay tuned for updates as the awesomeness takes place.
“I have been looking for some time for a camera to complement my smart home and I came to the conclusion that there is no product in the market that provides a decent solution for the user”, reads the introduction to a blog post I read the other day. This is particularly true of the Smart Home market, and any other market that deals with humans. Now, making a camera that solves all needs (human tracking, face recognition, people counting, etc …) is not straightforward, but that does not mean companies should not try to cover at least a small set of such desirable solutions. I believe there can be a demand for such things, but most people don’t know they need one yet.
The problem, seems to me, to be one of perception. While most people tend to consider a temperature or a light sensor a simple plug-and-play hardware that can be easily connected to a Smart Hub, camera solutions tend to be thought of as complex projects developed for a specific task and product. Why should this be the case? Many people, from Internet of Things (IoT) companies to makers, could benefit from an advanced plug-and-play “camera sensor”. One that could easily be plugged to your home hub, car, or any other Internet-enabled device and give access to its rich data (people id, objects recognized, etc) through its application programming interface (API).
Because I believe such things should exist, and because I believe once available many will benefit from it, I decided to create one such smart camera sensor. I am happy to introduce an early prototype of Project Jammin’s Face Sensor for the IoT, whose primary goal is to sense human emotions in real-time and without the need for cloud-based services. This sensor will offer the following functionalities and advantages out-of-the-box:
Facial emotion analysis
Offline and real-time processing
Small and affordable
Protect your privacy, no need to send video data to the cloud
Convenient API to collect detected emotions, faces, attention
Ability to build your own apps and systems with emotion sensing
This product, once it goes into production, will be suitable for retail, where it can be used to detect people’s reactions and attention to products. In education, such sensor can be used to monitor kids and determine best study times, preferred topics, etc. Smart homes could benefit by adding emotion-based automation, just imagine if your home could adjust the lights, temperature, and music based on how you feel. Healthcare is another area in which this sensor could be useful, by placing it in front of sick patients or the elderly, one could monitor their recovery based on their emotions. The limit is your imagination. While emotion detection is not a new thing, I have not found yet an offering that just works, like these proximity, temperature, etc, sensors that now proliferate.
Please find in the following video a demo of Project Jammin’s Face Sensor for the IoT:
Creating mood sensing technology has become very popular in recent years. There is a wide range of companies trying to detect your emotions from what you write, the tone of your voice, or from the expressions on your face. All of these companies offer their technology online through cloud-based programming interfaces (APIs).
As part of my offline emotion sensing hardware (Project Jammin), I have already built early prototypes of facial expression and speech content recognition for emotion detection. In this short article I describe the missing part, a voice tone analyzer.
In order to build a tone analyzer, it is necessary to study the properties of the speech waveform (a two dimensional representation of a sound). Waveforms are also known as time domain representations of sound as they are representations of changes in intensity over time. For more details about the waveform you can refer to this interesting page.
Using software specifically designed to analyze speech, the idea is to extract certain characteristics of the waveform that can be used as features to train a machine learning classifier. Given a collection of speech recordings, manually labelled with the emotion expressed, we can construct vector representations of each recording using the extracted features.
The features used in emotion detection from speech vary from work to work, and sometimes even depend on the language analyzed. In general, many research and applied works used a combination of pitch, Mel Frequency Cepstral Coefficients (MFCC), and Formants of speech.
Two components of “Project Jammin” are currently ready, a very basic facial expression detector, and an emotion classifier from text. Since this project is not meant to run on a phone or computer, but instead be a component of any connected hardware (or robot), the big missing part was a speech-to-text interface. In the past few days I have been working to implement this missing part.
After some research, and an attempt to balance performance, speed, and low-resource consumption, I decided to use the popular Pocketsphinx library. This library has been widely used with low cost hardware like the Raspberry Pi (which I am actually using to build my prototype). The installation process was smooth, but once I tested it with the built-in language model, the performance was terrible. The tool could not recognize a single phrase I said correctly.
After some research, I found out how to create my own language models. Since I am interested in a system that can transcribe conversational utterances (as opposed to dictations for instance), I decided to collect a chat log to create my model. After spending some time collecting data I was able to obtain a chat dataset with over 700k sentences. I then trained a trigram language model with a dictionary consisting of the 20k most frequent words. I was very excited, this chat log was big enough to recognize most possible sentences we say in a regular conversation.
After running the code with the new model for the first time, my initial smile faded away quickly. Although the accuracy was much higher than the one with the built-in model, the transcribed text was always a lot different than what I spoke into the mic. After tuning different parameters and testing over and over again, I never attained a descent performance. Out of desperation, I decided to make one last test.
After manually inspecting the dataset, I noticed that there were many sentences that did not really matter in an emotion-detection context. (e.g. “I will see you tomorrow”). With this in mind, I defined a small set of mood-related keywords (happy, afraid, …) as well as some words related to relationship (family, husband, …), and filtered out any sentence not containing the keywords. The result was a smaller dataset of about 5k sentences. Next I trained a new language model. The model was way smaller than the previous one, and only had around 3k unique words, but surprisingly, the recognition rate jumped dramatically.
Although this simple model is unable to detect every single phrase you say, it can recognize in near real-time many, if not most, emotion-loaded key-phrases. Later I will combine this with other input like facial expression recognition (done), and tone of voice detection (future work). The idea is to have different weak detectors working together towards a more robust emotion classification. In a near future, when all the parts are working together, I will let you know if I was right or if I was just dreaming. Meanwhile, check out this video of my speech-to-text + textual emotion classification working together.
Recognizing emotions in facial expressions is relatively straightforward for humans, and in recent times machines are getting better at it too. The applications of emotion-detecting computers are numerous, from improving advertising to treating depression, the possibilities are limitless. Motivated mainly by the impact in mental health that such technology can have, I started building my own emotion recognition technology.
In a previous post I described a quick test in which I used ideas drawn from research on how facial expressions are decomposed. In this simplified scenario a computer distinguished between sad and happy faces by detecting facial landmarks (points of eyes, mouth, etc …) and using one simple geometric feature of the mouth (representing a Lip Corner Puller). That single-rule algorithm was correct 76% of the time. As usual I got quickly overexcited and started defining other geometric features to improve and extend to six basic emotions (anger, disgust, fear, joy, sadness, and surprise).
To detect a Cheek Raiser, which basically closes the eyelids, and it’s more obvious when we laugh, I used the ratio of the height to the width of the eyes. To detect an Inner Brow Raiser, which basically raises the inner brows, and is characteristic of emotions like sadness, fear, and surprise, I computed the slope of a line crossing the landmarks representing the inner and outer brows.
As you can guess by now, manually identifying geometric features, to represent the nearly 20 actions necessary to perform the 6 basic emotions, got crazy hard pretty quickly. Not to mention that many were just impossible to define just using the landmarks (a Brow Lowerer just wrinkles the forehead). Even if I could successfully define them all, determining how to effectively combine them to detect an emotion would be just impossible by hand.
So I went back to machine learning, which essentially let’s a machine learn how to efficiently combine features to classify or detect things. To make my life easier, instead of manually defining the geometric features, I decided to just feed the machine a series of lengths of the lines representing a face mesh (as described here). The idea is that such lengths will vary from emotion to emotion as a representation of muscle contractions and extensions.
Given such lengths, 178 to be more precise, a classifier can be trained to recognize different emotions. In my particular case I tried the popular Support Vector Machines (SVM) and a Logistic Regression (Logit) classifier, trained on around 20,000 low-res images (48×48 pixels). Logit gave better results across 3 completely different test sets. For the NimStim Face Stimulus Set (574 faces) it achieved 54% accuracy, for a subset of images crawled from flickr user The Face We Make (850 faces) it achieved 55%, and for a set collected from Google Image Search and manually labeled by me (734 faces) it achieved 49% accuracy. The performance is not exactly human-like, and there are certainly systems way more accurate, but it is worth remembering that it only uses 178 features, and was trained in less than a minute in a laptop (as opposed to hours in multiple GPUs for state-of-the-art systems).
Finally, some papers I have surveyed mention that state-of-the-art accuracy can be achieved by combining geometric features with texture features. Texture features can be used to detect wrinkles in forehead, nose, and other parts of the face resulting from certain facial expressions. In the near future I will learn how to extract and try such features.