Automated Video Events Detection and Classification using CNN-GRU Model


  • Sajjad H. Hendi Informatics Institute for Postgraduate Studies, Iraqi Commission for Computers and Informatics, Iraq
  • Hazeem B. Taher The University of Thi-Qar – College of Education for Pure Sciences, Thi-Qar, Iraq
  • Karim Q Hussein Mustansiryha University-Faculty of Science, Computer Science Dept, Baghdad, Iraq



In the era of vast and continuous video content creation, manually identifying crucial events becomes a tedious and inefficient task. To address this challenge, we propose a CNN-GRU model that automatically detects and classifies significant events in videos. This model employs ResNet50 Convolutional Neural Networks (CNNs) to extract visual features from video frames, followed by Gated Recurrent Units (GRUs) for temporal modelling and event recognition. By leveraging the sequential data handling capabilities of GRUs, our model captures temporal patterns across frames. We evaluate the model's performance using accuracy and F1-score metrics on the VIRAT dataset, containing 1,555 events across 12 event classes. Our approach achieves promising results, with an event classification accuracy of 75.22%.


A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.

Y. Du and L. Wang Hierarchical recurrent neural network for skeleton based action recognition, 2015.

B. Yao and L. Fei-Fei, “Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1691–1703, 2012.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, “Long-term recurrent convolutional networks for visual recognition and description,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2015.

P. Wang, P. Wang, S. Wang, Y. Hou, and W. Li Skeleton-based action recognition using lstm and cnn, 2017.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” Proc. of the IEEE conf. on computer vision and pattern recognition, pp. 6479–6488, 2018.

N. Jaouedi, N. Boujnah, and M. S. Bouhlel, “Deep learning approach for human action recognition using gated recurrent unit neural networks and motion analysis,” Journal of Computer Science, vol. 15, no. 7, pp. 1040–1049, 2019.

A. Ullah, K. Muhammad, J. Ser, S. W. Baik, and V. H. C. D. Albuquerque, “Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM,” IEEE Transactions on Industrial Electronics, vol. 66, pp. 9692–9702, 2019.

J. O. Jeong Human Activity Recognition with Computer Vision.

H. Ullah and A. Munir, “Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework,” Journal of Imaging, vol. 9, no. 7, pp. 2023–2023.

S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. Lee, and . . & M Desai, “A large-scale benchmark dataset for event recognition in surveillance video,” CVPR, pp. 3153–3160, 2011.

P. Nair, U. Kumar, and S. Nandan, “COVID-19 Social Distance Surveillance Using Deep Learning,” in Computer Vision and Image Processing: 6th International Conf. Rupnagar, India, Revised Selected Papers, Part II, pp. 288–298, Springer International Publishing, 2021.

Y. Zhu, N. M. Nayak, and A. K. Roy-Chowdhury, “Context-aware modeling and recognition of activities in video”,” Proc. of the IEEE conf. on computer vision and pattern recognition, pp. 2491–2498, 2013.

X. Wang and Q. Ji, “A hierarchical context model for event recognition in surveillance video,” Proc. of the IEEE Conf. on Computer Vision and Recognition, pp. 2561–2568, 2014.

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks for action recognition in videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740–2755, 2019.

Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann, “Hidden two-stream convolutional networks for action recognition,” Computer Vision - ACCV, vol. 2018, pp. 363–378, 2019.

Meng, H.-Y. Xu-Hong, W.-H. Shi, and Shang, “Analysis of basketball technical movements based on human-computer interaction with deep learning,” Computational Intelligence and Neuroscience, vol. 2022, pp. 1–7, 2022.

G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1510–1517, 2018.

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2017.

K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.

B. Chen, H. Tang, Z. Zhang, G. Tong, and B. Li, “Video-based action recognition using spurious-3d residual attention networks,” IET Image Processing, vol. 16, pp. 3097–3111, 2022.







How to Cite

Sajjad H. Hendi, Hazeem B. Taher, and Karim Q Hussein, “Automated Video Events Detection and Classification using CNN-GRU Model ”, WJCMS, vol. 2, no. 4, pp. 77–86, Dec. 2023, doi: 10.31185/wjcms.188.