Automated Video Events Detection and Classiﬁcation using CNN-GRU Model

: In the era of vast and continuous video content creation, manually identifying crucial events becomes a tedious and ine ﬃ cient task. To address this challenge, we propose a CNN-GRU model that automatically detects and classiﬁes signiﬁcant events in videos. This model employs ResNet50 Convolutional Neural Networks (CNNs) to extract visual features from video frames, followed by Gated Recurrent Units (GRUs) for temporal modelling and event recognition. By leveraging the sequential data handling capabilities of GRUs, our model captures temporal patterns across frames. We evaluate the model’s performance using accuracy and F1-score metrics on the VIRAT dataset, containing 1,555 events across 12 event classes. Our approach achieves promising results, with an event classiﬁcation accuracy of 75.22%.


INTRODUCTION
In recent decades, there has been a notable surge in the worldwide prevalence of surveillance cameras, commonly referred to as Closed-circuit television (CCTV) systems.Given the extensive proliferation of cameras, the task of monitoring the copious amount of data generated by these devices poses a formidable challenge for human operators (Farrington et al., 2007).
Due to the rise of outdoor events, automatic event recognition technologies are needed.This capacity is important for event management, security, and tourism.The field has advanced with machine learning and deep learning.Computer vision and acoustic analysis have improved outdoor event recognition (Xu, 2021).
An outdoor human events recognition system necessitates the fulfilment of multiple requirements.Initially, the system must possess the capability to effectively manage substantial quantities of video data obtained from numerous cameras.Furthermore, the system must possess the capability to accurately identify and classify a diverse array of human activities within outdoor settings.Additionally, the system must possess the capability to function in real-time, thereby enabling the prompt delivery of alerts when deemed necessary (Sreenu & Durai, 2019).
The researchers used different and varied methods to Recognition outdoor events, starting from the traditional methods and then machine learning, up to deep learning [(Kamel et al., 2018) Traditional video action recognition methods face significant challenges.They struggle to capture crucial long-range temporal patterns, hindering their ability to comprehend video dynamics.These approaches often fail to effectively incorporate temporal data, especially in untrimmed videos where actions occur briefly.Moreover, they are tailored for trimmed videos where actions occupy substantial time, making them less adept at detecting actions in realistic scenarios where actions are fleeting [ (Wang et al., 2016) ,8].
Handcrafted features are another limitation of traditional methods.While deep convolutional networks can autonomously learn meaningful visual representations, traditional techniques rely on manually designed features that may not encapsulate the intricate diversity of video actions [1].
Furthermore, these traditional approaches are computationally intensive, storage-demanding, and lack end-to-end trainability.They rely on a two-stage process involving pre-computed motion information, unlike deep convolutional networks that can learn motion representations directly from video frames [2].
Deep convolutional networks, along with newer strategies, offer promising solutions to address these limitations and attain state-of-the-art action recognition performance.To counter the challenge of capturing long-term temporal dynamics, 3D spatio-temporal convolutions have emerged as a natural extension of 2D CNNs for videos.However, most current 3D CNN methods focus solely on RGB input, ignoring valuable optical flow and depth data, limiting their ability to harness multimodal information [3,4].
The need for substantial training datasets is another constraint of CNN algorithms [5,6].Although CNNs excel in stillimage recognition with large datasets, video-based action detection suffers due to the scarcity of extensive video datasets, potentially leading to overfitting and reduced model generalization [7].
CNN approaches for video action detection were initially challenged in capturing long-term temporal connections, fully leveraging multimodal information, and effectively utilizing extensive training datasets.However, the introduction of Recurrent Neural Networks (RNNs) has effectively addressed these limitations, facilitating the advancement of videobased action recognition.
RNNs and their various iterations, notably Long Short-Term Memory (LSTM), have made noteworthy advancements in the field of temporal modelling for the purpose of human activity identification [8].Recurrent Neural Networks (RNNs) possess the inherent capability to effectively retain and utilize preceding information within a sequence [9,10].However, Long Short-Term Memory (LSTM) networks specifically tackle the issue of vanishing gradients, hence enabling improved representation of extended temporal relationships [9,11].Nevertheless, the computational complexity associated with Long Short-Term Memory (LSTM) poses a significant obstacle, particularly when dealing with lengthy sequences and visual input of high dimensionality [9,12].In response to this matter, an alternative form of Recurrent Neural Networks (RNNs) known as Gated Recurrent Unit (GRU) has been suggested [13].The GRU presents a more streamlined option that mitigates the computational load while effectively capturing extended interdependence [9] Based on the preceding explanation, we present a novel model that addresses the aforementioned challenges.Our approach involves extracting features from the frames using a Convolutional Neural Network (CNN) with a customized layers arrangement.Subsequently, we employ a Gated Recurrent Unit (GRU) to capture the long-term dependencies in the data while minimizing computational complexity.

RELATED WORKS
Sultani at.el.[14] proposes an approach to learning anomalies in surveillance videos using weakly labelled training videos and deep multiple instances ranking framework.During the training process, the proposed method incorporates sparsity and temporal smoothness constraints into the ranking loss function in order to enhance the localization of anomalies.the proposed method used C3D for Feature extraction, while C3D features offer significant potential for video analysis but come with certain drawbacks.Their computation demands are high, leading to prolonged processing times, and they require substantial memory due to their 3D convolutional nature.
Jaouedi, at.el.[15] Introduced two methodologies for human action recognition.The initial motion detection and tracking method uses Gaussian Mixture Model (GMM) and Kalman filter.The suggested method identifies human motion in every video frame and tracks it using the Kalman filter.Deep-learning recurrent neural networks (RNN) are used in the second way to retain a conceptual state and understand movement.Gated Recurrent Units (GRUs) train the recurrent models to extract dataset features.The human action recognition methodology is then integrated to improve its performance on a larger dataset.Motion detection and tracking using GMM and Kalman filters work poorly in videos with cluttered backgrounds because background objects and movements can interfere with human action detection and tracking.Amin Ullah at.el.
[23] presented a theoretical structure for the identification and classification of activities inside surveillance footage obtained from industrial environments.The surveillance video stream is initially segmented into significant shots, with shot selection being performed through the utilization of a convolutional neural network (CNN) that incorporates human saliency attributes.Subsequently, the convolutional layers of a FlowNet2 CNN model are employed to extract temporal aspects of an activity within the sequence of frames.In this study, a multilayer LSTM model is proposed as a means of effectively capturing and learning long-term sequences within temporal optical flow data, with the ultimate goal of enhancing activity recognition.This approach encompasses a substantial level of computational intricacy and necessitates substantial resources for its execution.
J.-O.Jeong [24] proposed a hybrid SlowFast network-YOLO model technique is used to recognize human activity in surveillance videos.the SlowFast network on annotated actions was trained and used YOLO's object detection to identify and locate activities in surveillance videos.The intricacy of video surveillance with varied camera perspectives makes accuracy challenging due to class disparities and the tiny scale of human beings.This work uses both models to improve video activity recognition by tackling dataset variation and precise localization.Training with pre-trained weights improves convergence speed.However, the mean average precision (mAP) achieved on the validation set is relatively low (around 0.1).Class imbalances and small human subjects in videos could contribute to this.
Hayat Ullah at.el.[25] In this study, a spatial-temporal cascaded framework is presented for human activity recognition.The framework is designed to be computationally efficient and versatile, utilizing deep discriminative spatial and temporal data.The proposed CNN architecture employs a dual attention mechanism that combines channel and spatial attention to capture significant human-centric features from video frames, thereby representing human behaviours.The convolutional and dual channel-spatial attention layers are designed to prioritize spatial receptive fields that contain objects within feature maps.Stacked bi-directional gated recurrent units (Bi-GRUs) employ a combination of forward and backward pass gradient learning in order to imitate long-term temporal patterns and human action recognition by leveraging discriminative salient information.

VIRAT VIDEO DATASET
A sizable video dataset gathered for research on object detection, object tracking, and event recognition is called the VIRAT Video Dataset Release 2.0_VIRAT Ground [26].The VIRAT Ground Dataset is a subset of the VIRAT Video Dataset, which comprises a collection of ground-truth annotated video sequences for a variety of applications, including event identification, object tracking, and object detection.
This dataset consists of videos taken using stationary, high-definition cameras in eleven different situations.(1080p or 720p).The scenes are made up of multiple video clips, each of which could have one or more instances of events from eleven different categories.There are eleven videos in this collection, broken up into 329 individual video snippets.The categories of events and their corresponding shortcodes are displayed in Table 1.The 1555 events annotated video clips from a variety of sensors and cameras, encompassing a wide range of outside settings, are included in the VIRAT Ground Dataset.Annotations for events are included in the dataset, which is helpful for creating and assessing computer vision systems.Samples from the VIRAT Video Dataset Release 2.0_VIRAT Ground Dataset are shown in Figure 1 [27].

PROPOSED METHODOLOGY
This section provides a detailed analysis of our proposed event recognition technique and its component parts.In order to aid understanding, the suggested approach is divided into two independent parts, each of which is covered separately.We proposed that features are extracted from input videos using 2D CNN architecture.Our second primary component is the RNN-GRU network.The benefit of using GRU in the process of recognizing events from videos lies in its ability to extract sequential data associated with events.Figure 2 illustrates the conceptual workflow of our suggested approach.

PREPROCESSING
The VIRAT Video Dataset Release 2.0_VIRAT Ground Dataset necessitates obtaining the dataset files from a credible source, ascertaining the format of the dataset, and loading the dataset by means of the applicable technique.
Several data analysis or machine learning libraries have been used to handle and analyse the data once the dataset has been loaded into memory.Python tools like NumPy, Pandas, and Scikit-learn may be utilised for preparing and analysing data.Pre-processing procedures are covered in the sections that follow.

Resizing
to guarantee that every frame in the movie has the same size and to lower the histogram equalization algorithm's computing demands.The frames for each video are then resized to a standard size of 224x224 [24].

Convert to Gray
Converting an RGB (Red, Green, Blue) image to a grayscale image is a common image processing task that involves representing the image using only shades of grey rather than full color.There are various methods for converting RGB to grayscale, but one of the most common methods is the luminosity method.
By using the luminosity approach, every RGB pixel in an image is transformed into a grayscale value according to how much of a contribution it makes to the overall perceived brightness of the image.The formula for converting RGB to grayscale using the luminosity method is: This formula incorporates the consideration that the human visual system exhibits greater sensitivity to green light in comparison to red or blue light.The coefficients used in the formula are approximate and can be adjusted based on the specific requirements of the application.RGB image have convert to grayscale using the luminosity method in a video, we would need to apply the above formula to each frame of the video.

Histogram equalization
To achieve a more equitable dispersion of pixel intensities across the dynamic range that is at the disposal the histogram equalization has been applied to the grayscale frame [7].The first step is to calculate the histogram of the grayscale frame.The histogram represents the frequency of occurrence of each gray level in the image, as explained in Eq. ( 2).
Where PMF n is the normalized histogram of gray level frame f with bins for each intensity n.I n number of pixels with intensity n, N is the total number of pixels, L is the number of gray levels and 256 for 8-bit frames.Then, calculate the cumulative distribution function (CDF): Normalize CDF values to cover all gray levels (0-L-1) by multiplying the CDF value by (L -1): Calculate the gray level mapping function.Map function: The variable G j denotes the gray level that has been equalized for the initial gray level j.The round function is utilized for the purpose of rounding numerical values to their nearest integer values.After applying this transformation to every pixel in the image, the intensity histogram of the image will be more uniformly distributed.This can result in an image with better contrast and improved visual appearance.Figure 4 shows the one sample frame before applying Histogram equalization and after applying Histogram equalization.

FEATURES EXTRACTION USING 2DCNNS
Convolutional Neural Networks (CNNs) are among the most commonly used methods for image recognition and feature extraction.The input frame is taken and subsequently processed in order to extract the relevant features from each frame.The Convolutional Neural Network (CNN) model was implemented utilizing Keras, a free open-source deep learning library written in Python.For each pre-processed frame, a sequence of convolutional layers, employing filters (Kernels), Pooling, and fully connected (FC) configuration layers, will be utilized to successfully extract features.In ResNet50 CNN model, all parameters required for initialization are outlined in Table 2, with random values assigned.We have worked to eliminate the model's last layer of classification.This indicates that rather than producing class probabilities, the model will output the features that were retrieved from the last convolutional layer.
In computer vision and deep learning, the use of 2D-CNNs for feature extraction from movies is a commonly used method.CNNs have shown to be successful at processing pictures, and with little adjustments, they may also be used to handle movies.I'll go over the main procedures for utilizing CNNs to extract features from videos in this context.Pre-processing the video frames by reducing them to a set size and normalizing the pixel values is the first stage in feature extraction using 2D CNNs.This stage guarantees consistency in the input data and lowers the CNN model's processing cost.Next, a 2D CNN model is trained on the pre-processed video frames to extract relevant features.CNN layers include convolutional, pooling, and Convolutional layers that filter input frames to extract spatial features while pooling layers downsample feature maps to minimize their dimensionality.The extracted features were used as input to the gated recurrent units (GRUs).
In a 2D CNN, the feature extraction layers are the convolutional layers and the pooling layers.These layers work together to learn hierarchical spatial features from the input data (video frames, in our case).Convolutional layers apply filters (also called kernels) to the input, capturing local patterns within the data.As they go deeper into the network, the convolutional layers learn increasingly complex and high-level features.Pooling layers, such as max-pooling or averagepooling, help reduce the spatial dimensions of the feature maps and control the number of parameters in the network, making the model more efficient and robust to overfitting.

CLASSIFICATION USING GRUS
The sequence model using GRUs for video classification was implemented, which is a task of predicting the class label of a video based on its content.The model takes as input a sequence of features extracted from video frames using 2D CNNs and uses GRUs to model the temporal dynamics of the video by processing the sequence of features sequentially.The input layers of the model are two: frame_features_input(a sequence of features extracted from video frames) mask_input(a binary mask that indicates which elements of the input sequence are valid and which are padded).

GRU Layers
There are two Gated Recurrent Unit (GRU) layers in the model.First layer: 16-units, responsibility of handling the input sequences while taking the binary mask into account.Second layer: 8-units, processes the outputs that the preceding layer produced.Dropout Layer After the second GRU layer, a dropout layer is added to reduce the possibility of overfitting.dropout rate: 0.4 Fully Connected Layer A fully connected layer with 8 units and a rectified linear unit (ReLU) activation function is added after the dropout layer.

Output Layer
The output layer (dense layer) that creates a probability distribution across the different class labels using a softmax activation function.The number of units in this layer is equal to the total number of unique class labels in the dataset.

Compilation
The loss function, optimizer, and evaluation measure for training and evaluation are configured using the compile () method of the Keras model.When there are several classes and the names are integers, the sparse_categorical_crossentropy loss function is employed, which is advantageous.The Adam optimizer is a popular and efficient optimizer that is used for gradient-based optimization.The accuracy value serves as a gauge for the model's performance.
The model architecture is composed of two GRU layers with dropout and fully linked layers, followed by a SoftMax output layer.The model generates a probability distribution across the class labels after receiving a collection of features that were taken from video frames together with a binary mask that indicates whether items in the sequence are correct.The model is compiled using a suitable loss function, optimizer, and evaluation metric for the given classification task.It is essential to place the previously described results in the context of the specific problem and dataset.To determine potential pathways for improvement, more analysis and empirical study might be conducted.Some ideas include optimizing hyperparameters, exploring different architectures, or expanding the size of the training dataset.Ultimately, after assessing the different approaches used to pinpoint specific events in the VIRAT Video Dataset Release 2.0_VIRAT Ground Dataset, it was found that, in comparison to the previously used approaches, the suggested model produced results that were more accurate, as Table 6 illustrates.The model under consideration was subjected to comparative analysis with four established algorithms, as shown in Table 6.CAM [28], DHCM [29], ResNet50, and InceptionV3 have been trained on a large-scale dataset imagenet to learn a rich set of visual features.

CONCLUSION AND FUTURE WORKS
This study presents a novel hybrid methodology for the identification and classification of noteworthy events in videos.The proposed approach leverages Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) to achieve this objective.The model being proposed utilizes a combination of convolutional neural networks (CNNs) for visual feature extraction and gated recurrent units (GRUs) for temporal modelling and event recognition.The technique was evaluated using the VIRAT dataset, which consists of a wide range of events distributed across different categories.The empirical results show that the level of performance was good, with an accuracy of 75% in event categorization.This observation demonstrates how well the suggested methodology works to find noteworthy events in outside environments.In general, although the model that has been suggested displays promising, additional investigation and enhancement are required to progress event recognition systems with regard to dataset expansion, transfer learning, multimodal fusion, real-time detection, robustness, and human-centric recognition.

FIGURE 1 .
FIGURE 1. Two samples from the VIRAT Video Dataset Release 2.0_VIRAT Ground Dataset.

FIGURE 2 .
FIGURE 2. Overview of the Proposed Model

FIGURE 3 .
FIGURE 3. Converting a colorful frame into grey a) Original colored frame b) Gary's frame

FIGURE 4 .
FIGURE 4. The Gray frame and its histogram a) Original frame b) Equalized frame

FIGURE 5 .
FIGURE 5.The accuracy and loss of the proposed model

Table 1 .
The types of events and the shortcode

Table 3 .
The proposed GRU layers

Table 5 .
Comparisons of the proposed model to other models