Pavlo Molchanov, Shalini Gupta, Kihwan Kim, Jan Kautz. Hand gesture recognition with 3D convolutional neural networks. CVPR Workshops, 2015.

Molchanov et al. propose a 3D CNN for hand gesture recognition. The system consists of two networks, a high-resolution network and a low-resolution network – the predictions are multiplied during testing. The architecture is illustrated in Figure 1.

Figure 1: The two employed networks, i.e. the high-resolution network (top) and the low-resolution network (bottom), including all necessary parameters.

While the network architectures are quite simple, they perform thorough data augmentation during training. Fortunately, they detail their training and data augmentation approaches. For data augmentation they use:

  • reverse ordering of the frames and horizontal mirroring (computed offline, the remaining data augmentations are computed online during training);
  • rotation, scaling and translation – spatially;
  • spatial elastic deformation;
  • fixed-pattern dropout, i.e. setting the same (but randomly selected) pixels across all frames to zero;
  • random dropout;
  • temporal scaling (of duration) and translation;
  • temporal elastic deformation (elastic deformation extended to the temporal domain, see the paper for details).

In experiments, they show that using depth information alone performs better than using intensity data only. Still the combination outperforms both. They also observe that including pre-computed gradients increases final performance.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.