Improved Anomaly Detection in Surveillance Videos with Multiple Probabilistic Models Inference

Anomaly detection in surveillance videos is an extremely challenging task due to the ambiguous definitions for abnormality. In a complex surveillance scenario, the kinds of abnormal events are numerous and might co-exist, including such as appearance and motion anomaly of objects, long-term abnormal activities, etc. Traditional video anomaly detection methods cannot detect all these kinds of abnormal events. Hence, we utilize multiple probabilistic models inference to detect as many different kinds of abnormal events as possible. To depict realistic events in a scene, the parameters of our methods are tailored to the characteristics of video sequences of practical surveillance scenarios. However, there is a lack of video anomaly detection methods suitable for real-time processing, and the trade-off between detection accuracy and computational complexity has not been given much attention. To reduce high computational complexity and shorten frame processing times, we employ a variable-sized cell structure and extract a compact feature set from a limited number of video volumes during the feature extraction stage. In conclusion, we propose a real-time video anomaly detection algorithm called MPI-VAD that combines the advantages of multiple probabilistic models inference. Experiment results on three publicly available datasets show that the proposed method attains competitive detection accuracies and superior frame processing speed.


Introduction
The detection of abnormal events in surveillance videos is a significant task because watching the videos frame by frame manually consumes lots of time. The availability of large volumes of surveillance videos gives rise to a great demand for processing. However, this could be extremely challenging due to the uniqueness and unbounded nature of abnormal events in the real world. Besides, as it is infeasible to enumerate all kinds of abnormal events, we are unable to find a sufficiently representative set of anomalies. Based on the characteristics of the labeled data in the training set, video anomaly detection can typically be classified into the following three categories: supervised [1] where both normal and abnormal samples are labeled, semi-supervised [2][3][4][5][6] where only normal samples are provided, and unsupervised [7,8] where no training data is given. We aim to tackle semi-supervised video anomaly detection when only normal samples are required in the training set. An intuitive approach is to model the normality distribution of training data, and any sample which does not adhere to the normality distribution is identified as abnormal.
Probabilistic models are used for statistical data analysis, e.g., path planning [9]. Probabilistic models are widely used for establishing normality distribution of training data, such as Markov random field [5], conditional random field [6], probabilistic event logic [10], and local statistical aggregates [11]. The problem formulation of video anomaly detection based on multiple probabilistic models inference is presented as follows: (1) In the model training stage, given the training data X train = {x 1 , x 2 , …, x n } containing only normal video samples, the goal is to build a few probabilistic models p X ¼ fp X 1 ; p X 2 ; . . . ; p X m g of normal event patterns from X train . (2) In the detection stage, the testing data X test contains both normal and abnormal video samples, in which the samples that do not conform to the probabilistic models p X i ðxÞ; 1 i m are identified as anomaly. This is equivalent to a statistical test of hypotheses: H 0 Z: x is drawn from p X ; H 1 Z: x is drawn from an uninformative distribution other than p X .
If p X i ðx 0 Þ , e; 1 i m, we reject the null hypothesis H 0 and accept H 1 , i.e., x′ does not conform to the probabilistic models p X of normal event patterns, where ε is the normalization constant of the uninformative distribution [6].
There are too many abnormal events in the real world, and we divide them into four fine-grained categories: appearance anomaly, global motion anomaly, local motion anomaly, and long-term abnormal activities. For example, skaters, cyclists, and disabled people moving with the help of a wheelchair are local motion anomalies on the sidewalk, and they have a similar appearance to a normal pedestrian. The probabilistic models only using appearance features in [2,10,11] cannot detect these local motion anomalies. Long-term abnormal activities like loitering can be detected by Markov models [2,4]. However, Markov models are not sensitive to the other three kinds of abnormal events. To sum up, the above video anomaly detection models have some limitations such as high missed and false detection rates. Inspired by this observation, we characterize all four kinds of abnormal events using both appearance and motion features to ensure detection accuracy. Specifically, we employ multiple probabilistic models to learn appearance and motion features in surveillance videos respectively, and then integrate multiple probabilistic models into an anomaly inference algorithm to infer all kinds of abnormal events as much as possible. The main contributions of this paper are as follows: (1) We theoretically formulate video abnormal detection based on multiple probabilistic models inference as a statistical hypothesis testing problem. (2) We propose a novel video anomaly detection algorithm based on multiple probabilistic models inference called MPI-VAD. (3) To strike the trade-off between detection accuracy and computational complexity, we employ a variable-sized cell structure to help extract the appearance and motion feature from a limited number of video volumes.
The rest of the paper is organized as follows. Section 2 introduces the related work regarding two types of abnormal event detection methods -Accuracy First Methods and Speed First Methods. Section 3 presents our proposed MPI-VAD in detail. Section 4 describes the experiment settings and results from evaluation of MPI-VAD on three publicly available datasets: UMN, CUHK Avenue and USCD Pedestrian. Finally, Section 5 concludes our work and discusses possible improvements in the future work.

Related Work
Over the past decade, despite important advances in improving video anomaly detection accuracy, there is a lack of methods designed for real-time processing that impairs its applicability in practical scenarios. Real-time video anomaly detection means that a frame processing time is shorter than the time of a new frame received. Taking a 30 FPS video sequence as an example, video anomaly detection could attain real-time processing performance when a frame processing time is less than 33.3 milliseconds. According to the trade-off between detection accuracy and computational complexity, existing video anomaly detection methods can be divided into two main categories: Accuracy First Methods, which focus on improving the detection accuracy no matter the required frame processing times, and Speed First Methods, which are primarily concerned about reducing frame processing times to satisfy practical applications for real-time processing.
Accuracy First Methods: They usually achieve higher detection accuracy at the expense of increased computational complexity and frame processing times. An important characteristic of these methods is to select sufficient video volumes to be processed, such as dense scanning [2], multi-scale scanning [12], and cell-based methods [13]. Roshtkhari et al. [2] generate millions of features by an overlapped multiscale scanning techniques to enhance detection precision. Bertini et al. [12] compute a descriptor based on three-dimensional gradients from overlapped multi-scale video volumes. Zhu et al. [3] adopt histograms of optical flow (HOF) to detect anomalies in crowded scenes. Cong et al. [14] adopt multiscale HOF (MHOF), which preserves temporal contextual information and is a highly descriptive feature specifically for accuracy improvement. Although these local feature descriptors extracted from video volumes have shown promising performance, it takes long processing times to compute such feature descriptors. Leyva et al. [13] employ a variable-sized cell structure-based methods to extract features from a limited number of video volumes.
Speed First Methods: Though the above methods attain high detection accuracy, their frame processing times are extremely long, and some essential efforts should be made to reduce the computational complexity. Lu et al. [15] and Biswas et al. [16] manage to handle a few features even though they employ multi-scale scanning techniques. Lu et al. [15] employ multi-scale temporal gradients as the prime feature to speed up feature extracting. Biswas et al. [16] adopt the compressed motion vectors of a video sequence itself in a histogram-binning scheme as features. Adam et al. [17] analyze the optical flow for individual regions in the scene to meet real-time processing requirement; unfortunately, they only detect the appearance anomaly and cannot detect local motion anomaly and long-terms abnormal activities in surveillance videos. A common characteristic of these methods is that they are fast to extract features but not highly descriptive. These methods usually reduce frame processing times by employing low-complexity descriptors. In a word, these previously proposed methods mostly reduce the computational complexity at the expense of slightly lower detection accuracy. Our proposed method achieves a trade-off between detection accuracy and computational complexity.

Method
In this section, we firstly employ a variable-sized cell structure to extract appearance and motion features from a limited number of video volumes. Secondly, multiple probabilistic models are built on a compact feature set in the model training stage. Finally, we integrate multiple probabilistic models into MPI-VAD in order to detect four fine-grained categories of abnormal events.

Feature Extraction
In order to select the video volumes for analysis, we firstly construct a variable-sized cell structure for the whole scene (shown in Fig. 1). Local feature descriptors based on foreground occupancy and optical flow information are extracted from a limited number of video volumes (shown in Fig. 1). Each video volume u ∈ R 3 has dimensions m x × m y × m t , where m x and m y respectively correspond to the horizontal and vertical dimensions of the cell, and m t denotes the number of consecutive frames.
Foreground feature can efficiently describe the abnormal object presence such as trucks and wheelchairs. For each video volume u associated with the cell at position (i, j), the corresponding foreground occupancy F (i, j) ∈ R is computed as follows: where N is the total number of pixels in video volume u, and u (n) indicates whether the n-th pixel belongs to the foreground region. If foreground occupancy F(i, j) of a video volume u exceeds a threshold θ, the video volume u can be considered active, and only active video volumes are further analyzed.
Optical flow information can properly describe the motion anomaly such as crowd panic, fights and other sudden variations. To filter salient regions in active video volumes, we detect STIPs on the absolute temporal frame differences via the FAST detector (shown in Fig. 1). Optical flow energy O p (x p , y p , t p ) and an MHOF descriptor w p (x p , y p , t p ) are generated from each spatio-temporal support region centered in the STIP(x p , y p , t p ). Optical flow energy O p (x p , y p , t p ) is computed as: where N is the total number of pixels in a spatio-temporal support region, and v ðnÞ x and v ðnÞ y respectively correspond to the horizontal and vertical components of n − th pixel optical flow. The MHOF descriptor w p (x p , y p , t p ) is an 8-bin optical flow histogram with two layers of bins calculated in the range f0; p 2 ; p; 3p 2 g.

Multiple Probabilistic Models
Our method for building multiple probabilistic models of normal event patterns in the model training stage is illustrated in Fig. 2. Multiple probabilistic models are built on a compact feature set based on foreground occupancy and optical flow information.
Multiple probabilistic models are applied to detect various abnormal events in complex scenes, including appearance anomaly, global motion anomaly, local motion anomaly, and long-term abnormal activities. Foreground occupancy and optical flow energy are respectively analyzed with the distinct Gaussian Mixture Models. The MHOF descriptors are simultaneously analyzed with dictionary models and Markov models.  fields, e.g., iris segmentation [18]. To detect appearance anomaly like variable-sized objects, we use GMMs to learn foreground occupancy of normal video samples. The foreground occupancy of each cell is analyzed by a GMM with parameters h F ¼ fp F k ; l F k ; r F k g, respectively representing the weight, mean, and standard deviation of the k-th component of the GMM, as follows: where N is a normal distribution. Expectation-Maximization (EM) algorithm is used to train these local GMMs. The parameters of the models are determined exhaustively as follows: where F represents all the foreground occupancy to be processed, whose posterior likelihood is to be maximized by iterating the Akaike Information Criterion (AIC); and h F MLE is the corresponding parameter set that results in the maximum likelihood estimation.
Considering the spatially immediate neighborhood of local cells, we construct a final probability density function to calculate the posterior likelihood of F(i, j) from the current cell, as follows: otherwise; where γ is an exception-modified Kronecker delta function.
(2) GMM for Optical Flow Energy: Different from multiple GMMs for foreground occupancy in the scene, to detect global motion anomaly, we only employ a global GMM with parameters respectively representing the weight, mean, and standard deviation of the kth component of the GMM, as follows: where N is a normal distribution; and O p represents all the optical flow energy to be processed and h O MLE is the  (3) Dictionary Models for MHOF Descriptors: We are interested in capturing the local motion anomaly in the scene considering the fact that the activities may vary within the scene. For example, when both sidewalk and road exist in a scene, the activities on the sidewalk may largely differ from the activities in the road. Hence, we create an individual dictionary for each cell in the scene instead of creating a global dictionary as proposed in [2,19,20]. Each cell is assigned a dictionary generated from the set S of MHOF descriptors within the cell. We firstly use k-means to define the cluster centroid z i ∈ R 8 in a dictionary, as follows: The generated dictionary is associated with a normal distribution with parameters θ DIC = {μ DIC , σ DIC }, respectively representing the mean and standard deviation of the distribution, as follows: where d p = ‖w p − z i ‖ 2 , denoting the l 2 distance of the word w p ∈ S to the cluster centroid z i . When we calculate the posterior likelihood of the observed words w p ∈ S, d p ≈ 0 and p DIC (d p |θ DIC ) → 1; otherwise w p ∉S, d p ≫ 0 and p DIC (d p |θ DIC ) → 0. Maximum likelihood estimation is used to train the dictionary models.
(4) Markov Models for MHOF Descriptors: Finite-State Markov Chain (FSMC) is used to capture longterm abnormal activities like loitering. Because the activities in the scene vary significantly across different regions, we use multiple local Markov models for different regions to detect anomalous events in a scene, instead of creating a global Markov model as in [4]. Let us consider the current state X l given by the matching label l of the local dictionary, the probability density function of the FSMC is given, as follows: where L is the number of states defined by the total number of labels in the local dictionary. The matching label index l is defined as: and the associated state transition matrix A is defined as: The probability of words i and j both occurring is calculated by the concurrence of the two words. The order of occurrence of words i and j does not matter if the number of analyzed frames is limited; thus we make matrix A symmetrical.

Anomaly Inference
After building multiple probabilistic models of normal event patterns, a novel video anomaly detection algorithm based on multiple probabilistic models inference -MPI-VAD (shown in Algorithm 1) is proposed to detect four fine-grained categories of abnormal events in the detection stage. MPI-VAD integrates the multiple probabilistic models into video anomaly detection and synthetically considers the detection results from different probabilistic models inference. MPI-VAD works in two cascaded phases mask generation and multiple mask joint analysis, as follows: In the first phase -mask generation, the mechanism evaluates the posterior likelihood of appearance and motion features from video volumes. We generate three likelihood binary masks: foreground occupancy mask Mask FG , optical flow energy mask Mask OFE and MHOF descriptors mask Mask MHOF . The posterior likelihood of the foreground occupancy F is calculated as follows: The likelihood binary mask Mask FG is generated by thresholding γ FG , as follows: where ε FG is a posterior likelihood threshold used to determine whether the video volume corresponding to foreground occupancy F is abnormal; 1 denotes abnormal and 0 denotes normal. Similarly, we calculate the posterior likelihood of the optical flow energy O p and MHOF descriptors w p . The posterior likelihood of the optical flow energy O p is calculated as follows: The likelihood binary mask Mask OFE is generated by thresholding γ OFE , as follows: where ε OFE is a posterior likelihood threshold used to determine whether the spatio-temporal support region corresponding to optical flow energy O p is abnormal. The posterior likelihood of the MHOF descriptors w p is calculated as follows: The likelihood binary mask Mask MHOF is generated by thresholding γ MHOF , as follows: where ε MHOF is a posterior likelihood threshold used to determine whether the spatio-temporal support region corresponding to MHOF descriptor w p is abnormal.
In the second phase -multiple mask joint analysis, the above multiple likelihood binary masks are jointly analyzed to determine whether abnormal events occurred in surveillance videos. Specifically, if a video volume is identified as anomalous in any individual likelihood binary mask, the corresponding cell at time t is marked as anomalous, as follows: (20) In order to make the anomaly inference mechanism more resilient to noise, we use the two consecutive frames at times {t − 1, t} to determine the abnormality of the frame at time t, as follows: The binary mask g Mask t represents the final abnormal regions in frame t.

Experiment Settings
We have implemented MPI-VAD in MATLAB and tested it on a 3.2 GHz CPU with 16 GB RAM. We have verified the effectiveness of MPI-VAD on three publicly available benchmark datasets, i.e., UMN, CUHK Avenue, and USCD Pedestrian. Tab. 1 shows the details of the above three benchmark datasets. 1 http://mha.cs.umn.edu/ We construct the variable-sized cell structure according to cell growing rate α and initial vertical dimension y 0 . For MOG background subtraction, the background learning rate is set to 0.01 on all these datasets. The number of frames for background modeling is set to 200 on CUHK Avenue, USCD Ped1 and Ped2, but 300 on UMN. For FAST detector, the number of the strongest points is set to 40. When applying EM algorithm to train the GMMs, we limit the number of iterations k to 10 since we empirically observe that AIC usually does not provide additional information when k is set to more than 10. These parameters are tailored to the characteristics of video sequences in practical surveillance scenarios. Fig. 3 shows detection samples containing the detected abnormal events, which are marked with red masks. We evaluate the performance of MPI-VAD against several state-of-the-art methods. Experiment results show that MPI-VAD achieves competitive detection accuracy compared to no real-time methods and outperforms other real-time methods.

Results Evaluation
Two evaluation criteria are adopted to measure the accuracy of video abnormal detection, i.e., Framelevel criterion and Pixel-level criterion. The two evaluation criteria consider the matching degree between the detection results and the ground truth with different granularities.
(1) Frame-level criterion: Once a frame is detected to contain anomalous pixels, it is identified as an anomalous frame. This criterion focuses on abnormal event detection accuracy in the temporal dimension of videos. However, it does not consider the detection accuracy in the spatial dimension. Thus normal pixels in an anomalous frame are misidentified as anomalous. (2) Pixel-level criterion: The criterion focuses on abnormal event detection accuracy in the temporal and spatial dimensions. If 40% of the detected pixels are true anomalous pixels in a frame, the anomalous frame is considered to be successfully detected. The Receiver Operating Characteristic (ROC) curve is drawn to measure the detection accuracy. ROC curve is a curve of True Positive Rate (TPR) vs. False Positive Rate (FPR), as follows:

UMN CUHK Avenue USCD Pedestrian
Based on a ROC curve, two values are calculated as quantitative indexes: 1) Area Under Curve (AUC): area under the ROC curve. 2) Equal Error Rate (EER): the FPR value when the condition FPR + TPR = 1 is satisfied. Notice that AUC and EER are similar performance evaluation metrics, specifically, EER → 0 when AUC → 1. We also consider whether a method could attain real-time processing performance according to a frame processing time.
For the UMN dataset, we report frame-level ROC curves in Fig. 4 and evaluate the corresponding results in terms of the AUC and EER in Tab. 2. From Fig. 4, we notice that the detection accuracy of MPI-VAD is inferior to the methods proposed by Zhu et al. [3] and Li et al. [6]. From Tab. 2, we find that our method achieves the second shortest frame processing time and real-time performance.
For CUHK Avenue dataset, Fig. 5 shows frame-level ROC curves, and our method attains the best performance. From Tab. 3, we can observe that our method achieves the highest AUC and meets realtime performance. The shorter frame processing time attained by [15] is mainly due to the method do not employ optical flow estimation nor background subtraction to extract motion features and instead uses multi-scale temporal gradients with low computational cost.   [20,[23][24][25][26][27] tend to attain higher AUC and lower EER than real-time methods [15]. The method [23] achieves the highest frame-level AUC, and the method [24] achieves the lowest pixel-level EER, while their frame processing times are much longer than ours; however, our method achieves competitive detection accuracy and best real-time performance. Figs. 8 and 9 show ROC curves for UCSD Ped2 dataset, and Tab. 5 evaluates the corresponding results in terms of the AUC and EER. From Tab. 5, we find our method outperforms the fastest real-time methods [15], and attains the highest detection accuracy compared to no real-time methods [25][26][27][28].

Conclusion
In this paper, we integrate multiple probabilistic models into video anomaly detection and propose a novel video anomaly detection algorithm called MPI-VAD. Attributed to the multiple probabilistic models inference, MPI-VAD is able to detect various abnormal events in complex surveillance scenes. Our method employs a variable-sized cell structure to extract appearance and motion features from a limited number of video volumes and then achieves the trade-off between detection accuracy and computational complexity. We evaluate MPI-VAD on three publicly available datasets and attain competitive detection accuracies and real-time frame processing performance. However, MPI-VAD takes quite a long time to train multiple probabilistic models. Thus our future work will focus on reducing the required time. Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.