Gesture recognition: from ADAS to sign language translation
Demand for gesture recognition is on the rise
Gesture recognition is a computer vision field that encompasses a set of image and video processing algorithms that capture, analyze, and interpret human bodily motion, mainly hand movements and facial expressions. As we have already addressed some of the concepts of facial expression analysis, such as face recognition, head tilting detection, eye tracking, in our article on drowsiness detection we will focus on hand gesture recognition from now on.
How we built AI-based hand gesture recognition system
Our internal R&D team developed a camera-based hand gesture recognition system. Limited by 10 static hand gestures it uses custom CNN to analyze infrared images and video. Read on to find out how we achieved 97%+ recognition accuracy.
Step 1. Dataset creation
We have created a dataset of infra-red images of 10 distinct static hand gestures. The dataset contains 6 single-handed gestures and 4 two-handed gestures performed by different people under different conditions, e. g. at various angles and distances to the IR camera. We have gathered around 1,000 samples per gesture and applied data augmentation to the training dataset afterwards.
See the hand gestures we have selected to be recognized by our AI model:
OK or ring gesture | ILY sign | V sign | Vulcan salutation | Finger gun |
High five | Fist bump | Thumbs up | Pray sign | Hand heart |
Step 2. AI model training
We have built an appearance-based AI model to perform the classification of input visual information into 11 classes, as we consider the absence of any hand gesture as the 11th class. Only left-frame images captured by IR-camera were taken into account and these so-called depth maps were used to train a custom-build CNN.
Step 3. Real-life implementation
The created hand gesture recognition model can analyze both images and video. The only difference in the algorithm is the additional post-processing step taken while processing the video stream to smooth out recognition results. We achieve it by applying a temporal filter, which in our case is a weighted averaging of the recognition results obtained for successive frames of the video sequence. This common technique is often employed to remove noise from video or audio signals and reduce errors. See for yourself how we reached an average of 97% hand gesture recognition accuracy on real-world data.
Exploring approaches to gesture recognition applications
We have mentioned that our hand gesture recognition model is appearance-based. But what does that mean? Let’s dwell on the existing approaches to gesture recognition and the differences between them.
Shape-based model
Shape-based methods represent a hand by its contour or mask (inverted silhouette). They perform gesture recognition based solely on this information. Shape-based approaches lack accuracy for efficient gesture recognition, as they can derive only low-level features. These include contour/area ratios, finger/palm length and thickness, finger number, and similar.
Appearance-based model
Appearance-based models use only the general appearance of a hand. They extract required features and compare them with previously learned training images to match the unknown input image with the most similar known one. Appearance-based approaches utilize Machine Learning models of different complexity. The final gesture recognition accuracy depends mostly on the quality of dataset used for training and robustness of the ML model.
Model-based model
Model-based or volumetric gesture recognition algorithms approximate a hand with its 3D model. This way they analyze its position and movements in the three-dimensional space.
This method requires capturing information about the shape of a hand. This is often achieved by using stereo cameras or 3D sensors. The method might also apply the structure-from-motion photogrammetric technique to several 2D photos of a hand captured from different views. The 3D hand model is computed from the produced point cloud and is used for further analysis.
This approach is expensive and thus not used for real-time gesture recognition. More commonly, it’s applied in modern computer animation.
Skeletal-based model
Skeletal-based approaches detect the key points of a hand and compute its virtual skeleton. Around 20 key points can be detected on a hand. That is just enough to accurately recognize both static and dynamic hand gestures. Skeletal representation is now becoming the most common way to perform hand gesture recognition.