From ADAS to Sign Language Translation, How Gesture Recognition Disrupts our World

Demand for Gesture Recognition is on the Rise
Gesture recognition is a computer vision field that encompasses a set of image and video processing algorithms that capture, analyze, and interpret human bodily motion, mainly hand movements and facial expressions. As we have already addressed some of the concepts of facial expression analysis, such as face recognition, head tilting detection, eye tracking, in our article on drowsiness detection we will focus on hand gesture recognition from now on.
How We Built AI-based Hand Gesture Recognition System
Our internal R&D team developed a camera-based hand gesture recognition system. Limited by 10 static hand gestures it uses custom CNN to analyze infrared images and video. Read on to find out how we achieved 97%+ recognition accuracy.
Step 1. Dataset Creation
We have created a dataset of infra-red images of 10 distinct static hand gestures. The dataset contains 6 single-handed gestures and 4 two-handed gestures performed by different people under different conditions, e. g. at various angles and distances to the IR camera. We have gathered around 1,000 samples per gesture and applied data augmentation to the training dataset afterwards.
See the hand gestures we have selected to be recognized by our AI model:
![]() OK or ring gesture | ![]() ILY sign | ![]() V sign | ![]() Vulcan salutation | ![]() Finger gun |
![]() High five | ![]() Fist bump | ![]() Thumbs up | ![]() Pray sign | ![]() Hand heart |
Step 2. AI Model Training
We have built an appearance-based AI model to perform the classification of input visual information into 11 classes, as we consider the absence of any hand gesture as the 11th class. Only left-frame images captured by IR-camera were taken into account and these so-called depth maps were used to train a custom-build CNN.
Step 3. Real-life Implementation
The created hand gesture recognition model can analyze both images and video. The only difference in the algorithm is the additional post-processing step taken while processing the video stream to smooth out recognition results. We achieve it by applying a temporal filter, which in our case is a weighted averaging of the recognition results obtained for successive frames of the video sequence. This common technique is often employed to remove noise from video or audio signals and reduce errors. See for yourself how we reached an average of 97% hand gesture recognition accuracy on real-world data.
Exploring Approaches to Hand Gesture Recognition
We have mentioned that our hand gesture recognition model is appearance-based. But what does that mean? Let’s dwell on the existing approaches to gesture recognition and the differences between them.
- Shape-based methods represent a hand by its contour or mask (inverted silhouette) and perform gesture recognition based solely on this information. As only low-level features, e. g. contour/area ratio, length and thickness of fingers and palm, number of unfolded fingers, can be derived from such input, shape-based approaches lack accuracy for efficient gesture recognition.
- Appearance-based models use only the general appearance of a hand. They extract features, e. g. corners, edges, global descriptors, and compare them with previously learned training images to match the unknown input image with the most similar known one and make a prediction about the hand gesture based on found similarity. Appearance-based approaches utilize Machine Learning models of different complexity – from kNN and decision trees to deep neural networks, to classify hand gestures and the final gesture recognition accuracy depends mostly on the quality of dataset used for training and robustness of the ML model.
- Model-based or Volumetric gesture recognition algorithms approximate a hand with its 3D model to analyze its position and movements in the three-dimensional space. This method requires capturing information about the shape of a hand which is often achieved by using stereo cameras or 3D sensors. Another way of recovering the necessary three-dimensional information is by applying Structure from Motion photogrammetric technique to several 2D photos of a hand captured from different views. The 3D hand model is computed from the produced point cloud and is used for further analysis. As the model-based approach is computationally expensive it is not used for real-time gesture recognition applications and is mostly applied in computer animation.
- Skeletal-based approaches detect the keypoints (joints) of a hand and compute its virtual skeleton. Around 20 keypoints can be detected on a hand, just enough to accurately recognize both static and dynamic hand gestures. Skeletal representation is now becoming the most common way to perform hand gesture recognition.