Crowd Anomaly Detection has become a challenge in intelligent video surveillance system and security. Intelligent video surveillance systems make extensive use of data mining, machine learning and deep learning methods. In this paper a novel approach is proposed to identify abnormal occurrences in crowded situations using deep learning. In this approach, Adaptive GoogleNet Neural Network Classifier with Multi-Objective Whale Optimization Algorithm are applied to predict the abnormal video frames in the crowded scenes. We use multiple instance learning (MIL) to dynamically develop a deep anomalous ranking framework. This technique predicts higher anomalous values for abnormal video frames by treating regular and irregular video bags and video sections. We use the multi-objective whale optimization algorithm to optimize the entire process and get the best results. The performance parameters such as accuracy, precision, recall, and F-score are considered to evaluate the proposed technique using the Python simulation tool. Our simulation results show that the proposed method performs better than the conventional methods on the public live video dataset.
Conventional video monitoring techniques depend on a human operator to monitor and control the situations for unexpected and abnormal occurrences. Hence, a lot of effort has gone into anomalous incident identification in video monitoring. The modern improvements have positive effect on cost saving in labor [
The necessity for automatic detection and segmentation of sequences of relevance has arisen. Current technology necessitates a significant amount of configuration work on every video feed just before implementing the video analysis phase. Those occurrences are focused on predetermined heuristic algorithms complicating the detection approach and making it tougher to generalize it to the various surveillance scenarios [
The remaining portion of the article is structured as follows: Section 2 provides the literary works associated with this paper. Section 3 describes the proposed flow. Section 4 analyzes the behavior of the recommended approach and compares it with the conventional methodologies. And, finally, Section 5 concludes the overall objective of the paper.
In [
In surveillance videos, the authors in the paper [
Wang in his paper [
Chen in his paper [
Ma in his paper [
To identify abnormal occurrences, the authors in [
To solve the challenge of abnormal action recognition, Ding in his paper [
Poor contrast, noise, and the tiny size of the flaws may make finding individuals difficult. In order to measure the quality of detection, it is necessary to have complete information about the defect's geometry. In the paper [
Anomaly detection has grown in importance in the computer vision and pattern recognition fields in recent years. The primary problem is the wide variety of anomalous event settings. Determining an interface that spans the range of potential anomalous occurrences is challenging. As a result, the statistical processing of unusual occurrences may be defined as those that exhibit deviation from the regular expectations and are not consistent with the normal samples, which is a typical solution. Anomaly detection techniques may be generally split into two stages namely event representation and anomaly detection model. In event representation, relevant elements are extracted from the video to depict the occasion. As a result of the ambiguity in event description, the event may be classified as having either object-level or pixel-level properties. Images of sports history and sports energy are examples of the former, which utilize object trajectory features and object appearance traits to signify an event. There are a lot of obstacles that obstruct each other's view when using object-level features, which makes it tough to manage busy scenes. Pixel-level characteristics, such as Spatio Temporal Gradient (STG), Histograms of Optical Flow (HOF), and a Mixture of Dynamic Textures (MDT), are often derived from two-dimensional image blocks or three-dimensional video cubes according to their representation.
A model for anomaly detection must be constructed once the event characteristics have been obtained. Detecting anomalies requires creating rules or models for everyday occurrences. When a test result deviates from the model or violates the guidelines, it is deemed as an exception. Cluster-based detection models, state inference detection models, and sparse reconstruction detection models are some of the examples. The cluster-based detection approach, for instance, groups together normal events that are related in some way. As a result, during the testing phase, samples located outside of the cluster centers are deemed anomalous. The assumption is that the state inference model predicts a constant shift in normal occurrences for a longer duration. According to sparse reconstruction detection models, the fundamental concept is that normal events have a modest inaccuracy in comparison to abnormal occurrences. These approaches have shown promise in earlier research. However, there is a design flaw in such approaches since the event representation and anomaly detection models were developed independently. These procedures require a great deal of study time and effort to develop them individually, yet the techniques frequently fail. Generalization ability is weak when the video picture changes. Object identification, object detection, behavior recognition and health diagnosis have all been benefited greatly from the overwhelming performance of the deep learning methodology. Since the two stages of feature representation and pattern recognition are intertwined, deep learning techniques are most successful when they are used in conjunction with each other to maximize the performance of the model. It has the potential to enhance the method's generalizability in many situations. Researchers started to use deep learning in abnormal event detection due to its effectiveness and its efficiency. Hence, we propose a novel approach for abnormal event detection in crowded scenes using an Adaptive Google Net Neural Network classifier. Also, to enhance the accuracy in this abnormal event detection, we employ Multi-Objective Whale Optimization Algorithm. Our contributions in this work are:
A novel classification approach for detection of abnormal events in crowded scenes using Adaptive Google Net Neural Network classifier. Integration of a multi-objective whale optimization algorithm for accurate detection of the abnormal frames classified by the classifier.
This section explains the flow of the proposed work. The schematic representation of the proposed work is depicted in
A histogram is a visual representation of the probability density function of a specific type of information. An image histogram is a graphical depiction of the spectral propagation of grey values in a digitized image. The histogram can determine the frequency of existence of the different grey values in the pictures. The histogram of a digitized image with luminance degree in the interval [0, L – 1] is a continuous function. It is provided by,
Here j = (t + 1), (t + 2),…, (J − 1) and
Transform functions concerning cumulative density functions:
Transform function of the images is given by Transform Function (TF):
The above-indicated image with a Transform Function (TF) is later processed through a Gabor filter for denoising, leading to a final enhanced image. Gabor filters are especially effective in representing and discriminating between different textures. Gabor filters exhibit optimum localization properties in both spatial and frequency domain. Hence, they are used for motion analysis in abnormal event detection. The Gabor filter effectively defines images energy transfer and denoising because it utilizes frequencies and directional representations to differentiate and define the image texture. The Gabor filter of the
Here, c indicates the scale parameter,
The current deep learning framework increases the neural network's efficacy by extending the layers. The computational complexity of this concept increases dramatically as the layer goes more profound, which is a severe flaw. Google proposed the inception architecture known as Google Net. The interior surface of the neural networks was expanded to output numerous correlation propagations. The heart of this architecture is built on the notion that obtaining diverse likelihood functions with significant correlation with the input data optimizes the neural network outputs of every layer. The results are pooled into a unified data set in the fundamental inception v1 component whereas here the input data is given into four distinct stages (1 × 1, 3 × 3, 5 × 5 convolution units, and 3 × 3 max pooling unit).
There are totally eleven layers in the proposed architecture of adaptive GoogleNet neural network classifier. The layers include one layer for input, four convolution layers, three pooling layers, one mapping layer, one-fully connected layer and one output layer. The convolutional units collect diverse spatial data from the input information, while the max-pooling unit reduces the channels and sizes of the input information to extract discrete characteristics. The inception component is a means of extricating massive data into a small depth. The inception architecture has been changed to version 4 at this time. V1 has a slightly elongated form. This method uses the v1 framework to build three CNN units, an activation unit, and a max-pooling unit. In this classifier, the abnormal frames and the normal frames are passed into a number of layers, where they are verified and checked for abnormalities. This classifier helps to classify the abnormal frame by using the adaptive inception unit which can identify the abnormality in the video frames. The processing time of this classifier is very fast and the abnormalities can be detected rapidly. The frame is classified as an abnormal frame even if it contains at least one abnormal pixel in the test sample. The architecture of the adaptable Google Net Neural Network Classifier is shown in
Deep learning lists | Specifications |
---|---|
Size of the input | 200∼600 |
Filter count: 15 | |
Kernel size: 5 | |
CNN unit | Stride: 1 |
Padding: 0 | |
Pooling size: 2 | |
Max-pooling unit | Stride: 2 |
Padding: 0 | |
Filter count: 3∼15 | |
Inception unit | Kernel size: 1, 3, 5 |
Pooling size: 3 × 3 | |
Stride: 1 | |
Padding: 0 | |
FC layer | 2 units, [50,100] neurons |
Output size | Five classes |
Iteration | 10 |
The primary objective of inclusion of whale optimization algorithm in the proposed work is to improve the performance of the abnormal event detection in terms of its accuracy. The main principle of Multi-Objective Problems (MOP) is presented in this section. The MOPs are designed to reduce or increase many competing goal functionalities. Considering the reduction issue with numerous functions fj(a), j = 1, 2,…, N (in which N represents the overall count of operations) as in (13) to derive the MOP:
The Whale Optimization Algorithm (WOA) is a novel meta-heuristic algorithm that models humpback whales. The quest in WOA begins with the generation of a random set of whales. The whales approach their targets using bubble-net or encircling techniques. The whales adjust their posture in the encircling activity according to their ideal position:
In which r random vector ∈ [0, 1], and the score of b is linearly reduced from 2 to 0 as repetitions continue.
The bubble-net behavior shall be simulated using two methods. The initial one is the shrinking encircling, which is achieved by lowering the score of b in
Such whales may swim about their prey in a diminishing circle and along a spiral course simultaneously.
Moreover, the humpback whales find an unexpected way to attack the prey. The position of a whale is upgraded by selecting an accidental search agent rather than the optimal search agent, as shown below:
UMN dataset [
The proposed method is simulated utilizing the Python simulation tool, and the behavioral metrics are analyzed. The suggested technique is contrasted and compared with the existing approaches based on the performance metrics like accuracy, precision, recall and F-score. Criteria such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) are considered for its evaluation. The pixel numbers that the algorithm correctly detects as positive are referred to as True Positive (TP). The pixel numbers that the system correctly detects as unfavorable are referred to as True Negative (TN). The pixel numbers that are identified as positive but not exact are known as False Positive (FP). The pixel numbers that are recognized as unfavorable but not the exact ones are referred to as False Negative (FN).
Frame number | Accuracy | Precision | Recall | F-score |
---|---|---|---|---|
4 | 23.37% | 24.48% | 28.34% | 0.261553 |
8 | 33.37% | 34.48% | 32.34% | 0.361553 |
10 | 43.37% | 44.48% | 34.34% | 0.421553 |
15 | 48.37% | 46.48% | 36.34% | 0.481553 |
20 | 93.37% | 48.48% | 42.34% | 0.521553 |
25 | 52.37% | 52.48% | 46.34% | 0.561553 |
30 | 65.37% | 56.48% | 48.34% | 0.621553 |
35 | 68.37% | 58.48% | 52.34% | 0.641553 |
40 | 74.37% | 62.48% | 56.34% | 0.681553 |
45 | 73.37% | 64.48% | 62.34% | 0.721553 |
50 | 82.37% | 66.48% | 68.34% | 0.741553 |
65 | 84.37% | 68.48% | 74.34% | 0.761553 |
68 | 85.37% | 74.48% | 78.34% | 0.781553 |
74 | 86.37% | 78.48% | 82.34% | 0.821553 |
80 | 88.37% | 82.48% | 94.34% | 0.841553 |
84 | 93.37% | 83.19% | 98.34% | 0.861553 |
It determines the number of samples which were successfully detected. It determines how closely the outcome corresponds to the initial result.
Precision refers to a model's ability to recognize only critical ones. It's the proportion of positive predictions that are accurate.
The potential of a system to identify every relevant object is known as recall. It's the proportion of optimistic expectations that are correct from all the available ground truths.
The F-score, also termed as F1-score, measures the efficiency of a framework for a given dataset. It is utilized to assess binary categorization algorithms that classify samples as either “Positive” or “Negative”. The F-score is described as the harmonic average of the recall and precision of the system. It is also a method which integrates them. The F-score for various frame numbers is shown in
The above figure depicts various anomaly frames in which anomaly of the crowded scene is detected.
In this paper we have used a new strategy for identifying anomalous occurrences in crowded situations. This technique of Adaptive Google Net Neural Network Classifier uses Multiple Instances Learning (MIL) to dynamically develop a deep anomaly ranking framework. A multi-objective whale optimization algorithm is employed to obtain a more accurate determination of visual abnormalities. This predicts high anomalous values for abnormal video frames. The experiments revealed that the suggested strategy outperforms the conventional algorithms in detecting anomalous occurrences in crowded settings based on the metrics in the UMN dataset. The proposed method gives better results in comparison to the existing approaches based on its detection accuracy and the processing time. Our future work is to incorporate and implement contextual anomaly detection and localization in the crowded scenarios which will give more semantic and meaningful results to the proposed crowd anomaly detection technique. Hence more improvement in performance and quality can be achieved with such enhancements in our model.