Ji et al. propose to use 3D convolutional neural networks for action recognition. While the proposed approach – using regular 3D convolutions in an architecture depicted in Figure 1 on 7 consecutive frames of size $60 \times 40$ - is quite simple, it is interesting that the paper was first published in 2010, 2 years before Krizhevsky's ground-breaking work on the ImageNet  challenge and the revival of deep learning in computer vision. Still, the approach is quite limited, trained on $7 \times 60 \times 40$ (with 5 channels for each frame), it is questionable whether they would have been able to scale their system to higher spatial or temporal resolution. As feature channels for each frame, they use gray scale, gradients in $x$ and $y$ direction as well as optical flow. The number of parameters, $295458$, is also considerably low compared to the AlexNet  with roughly $60$ million parameters.
What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: