Molchanov et al. propose a 3D CNN for hand gesture recognition. The system consists of two networks, a high-resolution network and a low-resolution network – the predictions are multiplied during testing. The architecture is illustrated in Figure 1.
While the network architectures are quite simple, they perform thorough data augmentation during training. Fortunately, they detail their training and data augmentation approaches. For data augmentation they use:
In experiments, they show that using depth information alone performs better than using intensity data only. Still the combination outperforms both. They also observe that including pre-computed gradients increases final performance.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: