SOINN-Based Abnormal Trajectory Detection for Efficient Video Condensation

With the evolution of video surveillance systems, the requirement of video storage grows rapidly; in addition, safe guards and forensic officers spend a great deal of time observing surveillance videos to find abnormal events. As most of the scene in the surveillance video are redundant and contains no information needs attention, we propose a video condensation method to summarize the abnormal events in the video by rearranging the moving trajectory and sort them by the degree of anomaly. Our goal is to improve the condensation rate to reduce more storage size, and increase the accuracy in abnormal detection. As the trajectory feature is the key to both goals, in this paper, a new method for feature extraction of moving object trajectory is proposed, and we use the SOINN (Self-Organizing Incremental Neural Network) method to accomplish a high accuracy abnormal detection. In the results, our method is able to shirk the video size to 10% storage size of the original video, and achieves 95% accuracy of abnormal event detection, which shows our method is useful and applicable to the surveillance industry.


Introduction
The surveillance systems are set around our life for public and private safety. To record all possible anomalies within the view of the cameras, the surveillance videos are produced continuously every day around the clock. With the advancement of imaging systems, both the resolution and file size have increased. However, the compression rate of the state-of-the-art algorithms do not reduce as much file size as they increased. In addition, the surveillance videos are preserved more than one month by users, which indicates that there is a huge amount of video to be stored. Therefore, the storage capacity is crucial to the surveillance systems, and engineers seek for efficient strategies for file size reduction in surveillance videos.
as the video plays, and thus missing some key events in the video. To prevent this situation, computer-aided intelligent surveillance is helpful in this regard.
To solve the issues above, we propose a method which includes two major functionalities required for the surveillance industry, which are efficient data storage and intelligent surveillance. We extract the moving objects from the videos along with their trajectories and original images, and rearrange them to build a new video with reduced file size and enhanced observability. The rearrangement is by means of a technique of video condensation and abnormal detection.

Video Condensation
Video condensation, also called video synopsis [1], is a method to rearrange the sequence of moving objects within a video, which is useful when there are many vacant spaces in most of the video parts. The condensation method is performed by pulling moving objects in the video from different time and make them show up simultaneously to fill up the vacant spaces among video frames. Different from video compression, video condensation works on content analysis rather than traditional information theory, and is possible to reduce times of file size than traditional video compression methods.

Intelligent Video Surveillance
Traditionally, the surveillance video observers, often forensic officers, watch the video to find some clues for crime solving manually, which is labor-consuming and inefficient for crime solving. Therefore, intelligent video surveillance systems [2] are developed to find anomalies automatically. Usually, abnormal events are obviously different from normal ones, and intelligent systems works on the induction of unique features of abnormal events using statistics and some machine learning strategies.

System Description
According to [3], the methods for detection can basically partition into four categories: 1) knowledgebased, 2) feature invariant, 3) template matching, and 4) appearance-based. Appearance-based methods highlight relevant characteristics of the face images by machine learning from a training set. The eigenface-based methods [4,5] are particular techniques grouped into this category. An et al. also completed a tracking system using a simple linear Kalman filter [6]. In addition to the Kalman filter, K. Y. Wang proposed a real-time face tracking system based on particle filtering techniques. Shan et al. [7] proposed a new tracking algorithm named MSEPF (Mean Shift Embedded in a Particle Filter). Elmenzain et al. [8] have proposed a system to recognize the alphabets and numbers in real time based on hidden Markov models.
Our proposed system is a combination of video condensation and intelligent video surveillance, which aims at shirking video size by segmenting foreground and background, and rearrange its containing objects by sorting the degree of anomaly of each object. The system moves the detected abnormal moving objects to the front of the surveillance video, and other normal objects to the latter parts. The flow chart of the proposed method is shown in Fig. 1.
Particularly, our abnormal detection method is implemented by SOINN [9] clustering algorithm. The SOINN is a semi-supervised incremental machine learning method, which supports on-line learning. SOINN also allows partially labelled inputs, which meets the demands of existing intelligent surveillance systems.
Mixture Model) [10] produced background, but there are some drawbacks and issues on GMM. In this section, we propose the improved foreground segmentation and occlusion handling method to attain a more accurate trajectory feature extraction by GMM.

Proposed Foreground Segmentation Method
In most of the work, the foreground mask is obtained by the subtraction of GMM generated background. However, because some pixels are similar to the background, the produced foreground will contain holes, as shown in Fig. 2.
To solve the problem, we improve the foreground mask of each object by the following steps: first, use a bounding box to frame the objects and perform magnification; second, apply the Gaussian-based contour detection to the magnified bounding box; finally, perform morphological operations, such as opening and closing the fill the holes inside the contours. The process is illustrated in Fig. 3.   Condensation is a kind of synthesis, so the extracted foregrounds should have smooth edges otherwise there will be gaps in the synthesized frames. To smooth the edges, the following steps are performed: first, we convert the foreground images to grayscale by the following formula with weighted RGB channels: where F is the foreground image.
The Gaussian blur filter is applied as a low-pass filter to reduce images noises in the grayscale foreground. The Gaussian mask is obtained by the following formula: where σ is the standard deviation of the Gaussian function.
After applying the Gaussian blur to the foreground image and the smoothed foreground.
Subsequently, we detect contours within the bounding boxes by setting an adaptive threshold to remove the noises. The adaptive threshold T (x, y) is set adaptively on a pixel-by-pixel basis by computing a weighted average of the r-by-r region around each pixel location minus a constant which is empirically set to 5, where r is a mask size the same value as the Gaussian blur filter.
We apply the adaptive threshold to the following formula to detect the contour edge of a moving object.
where B is the obtained binary contour image of a moving object, and S is the blurred grayscale foreground.
The detection method is useful especially when there are strong illumination or reflectance gradients.
After removing the noises within the foreground, we remove the regions belong to the background, as illustrated in Fig. 4, which is performed by subtracting the edges of the background image.
Finally, the moving object mask is reconstructed by adding original foreground mask to the contour produced by the above steps, and the result is shown in Fig. 5. The result can be improved again by using morphological operations to fill the holes and remove noises.  After producing the foreground mask of objects in each frame, the foreground images can be extracted. In a moving object, the image sequence and their respective coordinates in the video can be extracted and saved as a new data structure as Fig. 6 shows. which redundant background pixels are discarded, and the video condensation is accomplished.

Image Processing for De-Condensation
After the foreground mask is obtained, its image sequence can be shifted to arbitrary parts of the surveillance video. The de-condensation process is performed by synthesizing the image sequence of a moving object to the surveillance video's background image, which can be formulated as follows: where I o is the foreground image of a moving object, I b is the background image of the surveillance video, M o is the foreground mask obtained from last section, and I t is the synthesized image frame of the de-condensed video. The process is depicted in Fig. 7.  If the foreground images are extracted in a rectangular shape, there will be sharp edges at the boundary after synthesis. To solve this issue, modify the foreground mask by padding some foreground pixels at the boundary in the mask M o , and form a new mask M r . The processed mask is shown in Fig. 8.
Subsequently, use M r to construct Gaussian pyramid, and then use Laplacian pyramid image fusion to fuse I t and I b . The image I r is the result of fusing and it has been smoothed the sharp edges in I t . An example of eliminating sharp edges as motorcycle image preprocessing is shown in Fig. 9.

Occlusion Handling
If two or more moving objects are too close to each other while moving, the foreground mask of a moving object may include some parts of others' as shown in Fig. 10.
The appearance model [11] is constructed to distinguish different Blobs in occlusion. The appearance model is based on the RGB color model which shows the appearance of each pixel of the moving object with a correlative probability mask M RGB (x) recording the probability of the moving object P c (x) observed at the corresponding pixel.    mask is zero, the probability of pixel in P c (x, y) is initialized to zero. Then the appearance of model is updated by the following formulas.
where f is the set of pixels of foreground, and α is set to 0.95.
The appearance model is updated continuously until its correspondence Blob disappears from the field of view.
To prevent occlusion from overlapping, find the "dispute pixels" in the appearance model. The probability of pixel i is formulated as: If P c i ðxÞ has non-zero value in more than one of the appearance model probability masks, each pixel is called "disputed pixel." Subsequently, use maximum probability classifier with RGB Gaussian model to determine which model produced it.
After applying the appearance model, the Blobs of each occluded pixels can be distinguished to their respective objects. In Fig. 11, the blue word in frame is the ID of Blob. We successfully separate the two Blobs from overlapping Blobs.

Abnormal Detection
After the information of trajectories in a surveillance video is extracted, they are used for our proposed abnormal detector. In the detector, first, the trajectories of normal and abnormal moving objects are collected and labelled; second, we use the SOINN machine learning model to build the detector. The model is employed to analyze moving objects in real time to find out abnormal moving objects in sequences of image frames.

Trajectory Features
In practice, the coordinate of the objects is discrete, and is in a non-continuous zig-zag form. We exploit the Kalman filter [12] based on linear regression to process the original trajectory points to obtain a set of smoothened trajectory points. Each trajectory point consists of a feature vector as expressed as follows. And: where x t and y t are the location of a moving object at frame t; both the location differences dx t and dy t are calculated by dx t = x t − x t−1 and dy t = y t − y t−1 in the t th frame, respectively.
After tracking n successive frames, a trajectory of length n is extracted. Thus, all the trajectory T i is linked to form a trajectory T of length m, where T i is the trajectory of the i th moving object and m = n × i. T is composed of the trajectories of all the objects in the scene.
The next step is the normalization of the feature values of a moving object, where the positions of x and y are normalized to the range of 0∼1 using: where width and height represent the width and height of a frame, respectively. For the moving orientation θ, we normalize the 0∼359 to 0∼1 accordingly.

Abnormal Detection Using SOINN
We collect trajectory information of moving objects in the scene T ¼ fðx p;1 ; y p;1 ; vx p;1 ; vy p;1 Þ; Á Á Á ; ðx p;m ; y p;m ; vx p;m ; vy p;m Þg The SOINN [9] model is employed to analyze moving objects in the real-time camera frames and find out abnormal moving objects.
The trajectory of the i th moving object is denoted as T i = {(x 1 , y 1 , dx 1 , dy 1 ), ⋅ ⋅ ⋅ , (x n , y n , dx n , dy n )}. There are several groups in the model, and the abnormal moving object detection system examines whether T i belongs to a normal group. If the trajectory T i does not belong to any normal groups, the trajectory is detected to be abnormal. The detail of the anomaly detection steps is listed as follows: 1. Build a neural network with an empty node Q 2. Input new pattern F into Q as a vector V F , where F ∈ T i and T i is the trajectory of length n of the i th object composed of F = (x p , y p , x v , y v ). 3. Calculate the Euclidean distance between V F and V q , and determine the top two winners s 1 and s 2 as follows: 4. Calculate s s 1 and s s 2 respectively based on the similarity threshold algorithm of SOINN to find the maximum value τ s and node s.
5. Calculate the Euclidean distance D F between F and node s by: 6. Calculate the sum of D F and τ s of all F i=1,⋅⋅⋅,n respectively by: 7. Calculate the degree of abnormal object trajectory by: where − 1 ≤ D A ≤ 1.
8. In the real world, the definition of abnormality is a fuzzy concept. In addition, the occurrence of abnormal moving objects is continuous. Hence, we set threshold s d from 0.6 to 0.8, which is a tolerance range to detect abnormal moving objects. Then we count frequency of abnormality occurrence f AO and set the threshold τ f as the number of abnormality. The abnormal object O A is determined by: Finally, according to the clustering result from SOINN, we use O A to indicate whether a moving object is abnormal. When a moving object is detected as abnormal, the system is able to alarm the users that some anomalies occur.

Feedback Learning
The SOINN autonomously detects abnormal trajectories based on the built model. However, without prior knowledge, some obvious abnormal objects need long learning time to be detected. Therefore, the feedback mechanism is designed by implementing a user interface which accepts human markups. Suppose that we have already known some forbidden areas, we can mark these areas up with a rectangular markup in Fig. 12, which any objects trespassed the regions are labelled as abnormal, and these object trajectories are retrained and update the SOINN model. In [13], Ashish mentions that the multicue feature fusion ensures that the limitations of the individual cue are suppressed and complementary in the unified feature. Sensors fusion in intelligent surveillance system is our directions for learning method improvement.

Experimental Results
In this section, the experimental results of two major contributions of this paper are presented as the video condensation rate and abnormal detection accuracy.

Experimental Setup
In the experiment, three types of video have been used to verify the performance of our proposed method. The scenes are square, campus, and freeway, respectively. On the orientation of the moving objects, the square scene has the highest degree of freedom, the freeway scene has the lowest, and the campus scene is in the midst of the other two. Tab. 1 shows the size and other related information of the three scene type video.

Results on Condensation Rate
To evaluate the performance of surveillance video size reduction, the condensation rate, which is defined as the time that the video size has reduced.
Tab. 2 shows the condensation rate of the three scenes of the video, which shows that our method is more efficient in video size reduction than other most used image compression formats. In addition, the scenes with higher degree of freedom in orientation has better condensation rate.

Results on Abnormal Detection Accuracy
The abnormal detection accuracy is evaluated on each trajectory, and perform statistics on the correctly classified count. We use the accuracy equation as follows as a statistical measure.  where TP is for the number of true positive samples, TN is for the number of true negative samples, FP is for the number of false positive samples, FN is for the number of false negative samples. Accuracy is the most common model measurement tool. Precision places more emphasis on the "model predicted to be true" result. The recall rate puts more emphasis on the "true-to-life" results. In the proposed method, it should be judged that both "prediction" and "judgment" are completely correct samples.
To evaluate the effectiveness after applying the improvements of trajectory feature enhancement, we use the square and campus scene to compare the accuracy before and after application. Tabs. 3 and 4 shows the improvement of accuracy after applying the enhancement methods.
In addition, we also compare the accuracy by detecting the abnormal moving object by single feature only, and Tabs. 5 and 6 shows that both scenes have high accuracy using our detection method, but the content of the video decides which feature has the best accuracy.

Conclusion
In this paper, we have proposed a new video condensation method which rearranges the moving sequence of the video to reduce video size and improves observability for forensic officers. The video can be reduced to 10 times as the original video, and the accuracy of the abnormal detection is around 95%.
However, without prior knowledge sent to the SOINN, the feedback mechanism is still required during the learning process in the proposed method. In the future, it should consider that the avoids the eventual drift of the tracker during illumination variation, rotation, and deformation. To ensure more robustness and accuracy in rainy days or crowded environments. In the future, we would develop a knowledge based mechanism for abnormal detection to efficiently reduce learning time of the abnormal detection method.