⁃ The image is grabbed by the camera;
⁃ A first deep learning model detects the hand on the image and estimates the coordinates of the box around it (done by retraining tensorflow object detection API on hand detection, you could also achieve it by building a custom deep learning model);
⁃ A second deep learning regression model takes the image inside the box and estimates the coordinates of all hand keypoints (achieved by transfert learning from a resnet34 with a custom head).
retraining a tensorflow's object detection model (trained on COCO dataset) on hand detection dataset. I picked MobileNet_v2 for speed. In case you are using Open Image dataset, I wrote a custom script to convert the data to the required format.
Fine-tining a resnet34 model with Fastai. the full code in the note book. If you use this code and would like to cite this work, use the below:
rafik Rahoui: Hand keypoints detection using Convolutional Neural Networks, https://github.com/rafik-rahoui/Hand-keypoints-detection