Surveillance Video Key Frame Extraction Based on Center Offset

: With the explosive growth of surveillance video data, browsing videos quickly and effectively has become an urgent problem. Video key frame extraction has received widespread attention as an effective solution. However, accurately capturing the local motion state changes of moving objects in the video is still challenging in key frame extraction. The target center offset can reflect the change of its motion state. This observation proposed a novel key frame extraction method based on moving objects center offset in this paper. The proposed method utilizes the center offset to obtain the global and local motion state information of moving objects, and meanwhile, selects the video frame where the center offset curve changes suddenly as the key frame. Such processing effectively overcomes the inaccuracy of traditional key frame extraction methods. Initially, extracting the center point of each frame. Subsequently, calculating the center point offset of each frame and forming the center offset curve by connecting the center offset of each frame. Finally, extracting candidate key frames and optimizing them to generate final key frames. The experimental results demonstrate that the proposed method outperforms contrast methods to capturing the local motion state changes of moving objects.

The changes in the local motion state of moving objects can arouse more attention, especially in surveillance videos. Local motion can be accurately reflected by the center offset of moving objects, thus we define the frames with the maximum local center offset of moving objects as key frames, and propose a key frame extraction method. To the best of our knowledge, there is no such published work considering this issue. Therefore, it is interesting and worthwhile to research.
The remainder of this paper is arranged as follows. Section 2 briefly reviews several previous motion-related key frame extraction methods and motion object detection methods. Section 3 explains the concept of objects center offset and describes the framework of the proposed method. Experimental results of the proposed method and contrast methods on various video sequences are presented in Section 4. Finally, conclusions are provided in Section 5.

Motion-Related Key Frame Extraction Method
In this section, we review the traditional motion-related key frame extraction methods. As an effective method to solve the problem of large video data browsing, key frame extraction has been widely used in surveillance video applications. A comprehensive and detailed investigation of the existing key frame extraction methods has been made in [15,16].
There are numerous key frame extraction methods based on motion analysis. Wolf [17] first calculated the optical flow for each frame to set a motion metric and then analyzed the metric as a function of time to select the key frames. This method can select key frames appropriate to the composition of the video shot. However, considerable computation is required to calculate the optical flow. Liu et al. [18] put forward the hypothesis that motion is a more salient feature in presenting actions or events in videos. Based on this hypothesis, a triangle model based on perceived motion energy (PME) represents the motion activities in video shots. Liu et al. [19] addressed key frame extraction from the viewpoint of shot reconstruction degree (SRD) and proposed an inflexions-based algorithm. This algorithm first calculates each frame's motion energy to form a curve and then uses polygon simplification to search the inflexions of the energy curve; finally, the frames which at the inflexions of the energy curve are extracted as key frames. It shows effective performance in fidelity and SRD; however, the inflexions of the energy curve are not the same as the inflexions of the video sequence. Ma et al. [20] proposed a new key frame extraction method based on motion acceleration. This method uses motion acceleration of the primary moving object to obtain the motion state changes, such as start, stop, acceleration, deceleration, or direction change. The key frames extracted by this method can describe the changes of the motion state. Li et al. [21] presented a motion-focusing method to extract key frames, which focuses on one constant-speed motion and aligns the video frames by fixing this focused motion into a static situation. According to the relative motion theory, the other video objects are moving relative to the selected kind of motion. Zhong et al. [22] proposed a fully automatic and computationally efficient framework for analysis and summarization of surveillance videos. This framework uses the motion trajectory to represent the moving process of the target. Zhang et al. [23] presented a method for key frame extraction based on spatio-temporal motion trajectory, which can obtain the state changes of all moving objects. This method defines frames at inflexions of motion trajectory on the spatiotemporal slice (MTSS) as key frames. The reason is that the inflexions of the MTSS can capture all motion state changes of moving objects.
The above methods can all show excellent performance under the circumstances, however, they tend to ignore the changes in the local motion state of the moving objects. The center offset of moving objects can be employed to describe the changes of the local motion state. Under this observation, the paper proposed a key frame extraction method based on center offset.

Moving Object Detection
As one of the most fundamental and challenging problems in object extraction, object classification [24][25][26], object tracking [27], crowd counting [28] and object recognition [29], objection detection has attracted considerable attention in recent years. Many papers on moving object detection have been published. A study on various methods used for moving object detection in video surveillance applications has been made in [30].
As a hot topic in video processing, moving object detection plays a vital role in the subsequent processing of object classification, tracking, and behavior understanding in videos. However, due to the complex video scenarios, there are still many problems with moving object detection needed to be solved. Currently, the background subtraction method and frame difference method are two common methods for moving object detection in surveillance videos. The background subtraction method's basic steps [31] are as follows: firstly, establishing the background model and then comparing the input image with it. Finally, moving objects are detected by the statistical information changes such as gray level or histogram.
The conventional inter-frame difference method is to subtract two consecutive adjacent frames to obtain moving objects. If a pixel is very different from the surroundings, it is usually caused by moving objects in the video frame. If these pixels are marked, the moving objects in the video frames can be obtained. This method is simple, and the amount of the calculation is not very large, but the obtained moving object maybe with "holes". Therefore, some scholars have improved the traditional inter-frame difference method, and the more effective one is the three-frame difference method.
In addition to the above methods, there are optical flow methods [32,33], background modeling method, etc. Combining the advantage of various moving object detection algorithms [34] can reach a good detecting result. Based on the analysis of the above methods and the experimental videos' actual scene, this paper adopts the background difference method with background updating to detect moving objects.

Center Offset
The center point of each moving object shape is defined as the center point, which can also be called the centroid in mathematics. Mathematically, the centroid of a curved surface is the geometric center of the cross-section figure, and the centroid is the centroid of the abstract geometry. For objects with uniform density, the center of mass coincides with the centroid. In the process of motion, the moving object in the video may have different action behaviors. We think that the moving object is an abstract geometry with uniform density, changing its shape constantly. Therefore, different cross-section shapes will be left in each video frame during the moving process. The cross-section shape generated in each frame, called the object motion shape, as shown in Fig. 1. Fig. 1 shows the shape formed by the target when the target is doing erect, reaching, squatting, etc. From Fig. 1, we find that when a moving object makes local motion, the motion shape will change, that is, the position of the object center will be offset. That is why the center offset is employed to reflect the changes of the local motion state.
Next, how do we calculate the center coordinates of the object moving shape? In the Cartesian coordinate system, if the coordinates of the vertices of the triangle are (x 1 , y 1 ), (x 2 , y 2 ), and (x 3 , y 3 ), respectively, the coordinates of midpoint (x, y) can be calculated as: If the fixed-point coordinates of the rectangle are (x 1 , y 1 ), (x 1 , y 2 ), (x 2 , y 1 ) and (x 2 , y 2 ), respectively, then the coordinates of midpoint (x, y) can be obtained by: When the figure is a polygon, the double integral is needed to calculate the centroid. To simplify the calculation, the center point of the circumscribed rectangle of moving object is selected to represent moving object. An example of the center point is shown in Fig. 2.  Fig. 2, it can be seen that the position of rectangle center point changes with the position of the rectangle's four vertices. It indicates that when the moving object makes local motion such as bending over or stretching up, it will cause the outer rectangle changes, and the position of outer rectangle center point will change accordingly. That is, the object center offset can reflect the changes of both global and local motion state of the moving object. Therefore, we select the center offset of outer rectangle as the motion descriptors, and use it to describe the changes of all motion states.
When there is only one moving object in the video frame, the center point of the object rectangle is the center point of the video frame. However, when there are multiple moving objects in the video frame, the center point of video frame is the average value of the center points of each object, as shown in Fig. 3.  Fig. 3, it can be found that when multiple objects are moving at the same time, the average value of each object rectangle center point is used as the center point of the video frame. The reason is that when one object moves, its center point will change, the coordinates of the frame center point will change too, so the center offset of the video frame can reflect the changes of each object's motion state. Therefore, it is feasible to use the center offset of moving objects in adjacent video frames to reflect the changes of moving state of objects.Under this observation, a video key frame extraction method based on moving target center offset is proposed.
For video V, the center offset of moving object can be defined as: where CO (t) represents moving object center point offset at time t, CO x (t) and CO y (t) are the horizontal component and the vertical component of CO (t), respectively. Let P (x 1 , y 1 , t − 1) and P (x 2 , y 2 , t) denote the coordinates of moving object center point at times t − 1 and t, respectively. Then the center offset CO (t) can be expressed as: The vector in Eq. (4) can be computed as: where |CO (t) | and θ(t) represent the magnitude and angle of the CO (t), respectively. Where: Eq. (5) shows that when |CO (t) | is large enough, it is easy to be extracted as a key frame. However, it does not only depend on |CO (t) |, but also exp[−jθ(t)] is a very important factor. For simplicity, exp[−jθ(t)] is defined as: The center offset of each video frame can be calculated by using the above equation.

Key Frame Extraction Based on Center Offset
This paper defines the frame where the center shift peak abruptly changes as a key frame. Accordingly, a novel key frame extraction method based on center offset is proposed. The framework of the proposed method is shown in Fig. 4. Step 1. Moving object extraction Firstly, it uses the background subtraction method to detect the moving object in the input surveillance video sequence, then extracts the moving object, and finally marks the moving object with the circumscribed rectangle.
Step 2. Center point extraction It selects the midpoint of the circumscribed rectangle of moving object as the object center to obtain the coordinate value of the object center.
Step 3. Center offset curve generation It calculates the center offset of the object by using the center point coordinates which have been known, and then connects the center offset of each frame to form a center offset curve.
Step 4. Peak detection The peak of the center offset curve formed in Step 3 is detected, and the video frame corresponding to the peak value of the curve are extracted as candidate key frames.
Step 5. Key frames extraction In order to reduce the redundancy of key frames, it needs to extracted the video frame where the peak value of the current frame is N times that of the previous key frame. Finally, the extracted video frames are composed of the video frames at the peak mutation, the first frame and the last frame of the input surveillance video.
Next, optimize the extracted key frames according to the visual resolution mechanism [35] to determine the final key frames.
In practice, the key frame number k will be extracted to ensure the objectivity. When the number of extracted key frames K (i.e., the final number of key frames determined in (Step 5)) is less than the specified number of key frames, the video frames with larger peak value except for the key frames are inserted first. If the video frames at other peak points are not enough for K-K frames, the remaining video frames are used to make up for the missing ones by interpolation method [36]. On the contrary, the smaller peak (K-K) frames in the final key frames determined in Step 5 are removed, and the specified key frames number K is extracted.

Experimental Results and Analysis
To correctly evaluate the correctness and effectiveness of the proposed method, we executed the experiments to verify its validity and superiority over the state of the art methods. The experiments were performed on a general-purpose computer with an Intel Core (TM) i5-4200 CPU and 8 GB memory.

Test Data Set
The experiment used 16 test videos of different scenes to ensure the generality of the method. Some of them are from standard data set ViSOR [37], CAVIAR [38], and BEHAVE [39], and others are self-collected surveillance videos. Tab. 1 shows the detailed information of the above test video.

Evaluation Criterion
To demonstrate the correctness and effectiveness of the proposed method, subjective and objective evaluation criteria are all used in this experiment. Subjective criteria mainly include result discussion and user studies, and the widely used objective evaluation criteria are Fidelity [40] and SRD [19]. Compared with the Fidelity criterion, the SRD criterion can evaluate the key frames from the dynamic aspect of capturing local details. If it has high SRD, it must have high Fidelity. Nevertheless, high fidelity does not necessarily mean high SRD. Therefore, the result discussion can verify the correctness of the proposed method, and comparative analysis and SRD criteria to verify the effectiveness.

Correctness
To demonstrate the correctness of the method, we applied the proposed method to 16 test videos and achieved desirable results. To be specific, the extracted key frames indicated that frames with the global and local motion state of objects, in a variety of scenes, could be effectively extracted by the proposed method. Due to space limitations, the article only takes the two key frame extraction results in Figs  This method discards some video frames with high peaks and extracts video frames with relatively low peaks as key frames. This is the result of the parameter setting and optimization criteria. In detail, due to the influence of environmental changes, the video frames before No. 58 have a higher peak value. Therefore, we optimize the experimental results by setting parameters and key frame optimization. The frames after No. 148 got lower peak values due to the parameter settings. By setting the parameters, we have extracted some key frames. This ensures that extracted key frames can describe the whole motion of video 8. In this video, the scenes that attracted more attention were the appearance of two targets, the handshake of the two targets, and the changes in their movement after the handshake. Observing the result of key frame extraction, we can found that frame No. 58, No. 117 and No. 136 respectively show the appearance of two objects (the changes of global motion state of the objects), and frames No. 148, No. 172, No. 187 show the process of reaching out before handshake (the change of local motion state of objects). Frames No. 206 and No. 229 show the handshake of two targets (the change of local motion state of objects, and frame No. 251 shows two objects moving in another direction after handshake (the change of global motion state of objects).   6 shows the similar results. The video frames at the first peak and the last peak are calculated according to the optimization criteria, which are similar to the first frame and the last frame. The proposed method can extract the target squatting action (frame No. 6), the target squatting action change (frame No. 13) and the target standing up action (frames No. 21,No. 26,No. 29). This demonstrates that the proposed method can extract the changes of the local motion state well.

Figure 6:
Video key frames extracted from video 16 As discussed above, we extract the video key frames based on the attractive feature of local and global motion state changes, and obtain them by analyzing the offset of the target center. Consequently, they are consistent with human visual perception. The discussion in this subsection validates the correctness of the proposed method.

Effectiveness
To verify the effectiveness of the proposed method, also test it with state of the art motionrelated methods. The experiment compares the method proposed in this article with other methods such as the method based on the perceptual motion energy model in [18] (denoted as ME), the method based on motion acceleration (denoted as MA) in [20], and the method based on spatiotemporal motion trajectory In [23], it is expressed as MTSS for comparison, and in [41], the method based on motion speed (denoted as MV) is implemented. They are closely related to the proposed method. To ensure the universality and robustness of the proposed method, experiments were conducted on 16 test videos of the public data set and self-collected video. The performance comparison of the proposed method with the other methods was using the subjective criterion and objective criterion. The details are presented as follows.
The key frame extraction results of the five methods are firstly evaluated using the subjective criterion. In order to ensure the objectivity of the experiment, every method extracts 10 frames as key frames. Through the test on 16 video segments, the results of five key frame extraction methods are obtained. The proposed method is superior to the others. Due to the limited number of pages, only the key frame extraction results of video 8 are displayed. The key frame extraction results of the proposed method and contrast methods of video 8 sequence are shown in Fig. 7.  Fig. 7, it can be seen that the proposed method extracted the process of two targets appearing separately and shaking hands. MV can extract the video frames of the two targets and the handshake action of the two targets, but the video frame extracted by this method omits the movement process before the handshake. The key frames extracted by MTSS omitted the video frame in which the first moving object appears. The key frames extracted by MA omitted the video frame of the first moving object appearing, and it failed to extract the video frame of the second moving object appearing. The key frames extracted by ME have much redundancy and blank frames. To sum up, the proposed method can extract the video frames with motion state changes in it, especially in the multi-object surveillance video.
As an objective criterion, SRD is used to evaluate the key frame extraction results of the proposed method and its contrast methods. SRD criterion is to evaluate the key frame extraction method from the aspect of video reconstruction ability. The larger the calculated SRD is, the better the video reconstructs. This means that the video reconstructed by the extracted key frames is closer to the original video. Fig. 8 shows the average SRD obtained by the proposed method and its contrast methods on all test videos with different key frame ratios (2% -12%).
From Fig. 8, it can be seen that the average SRD increases with the number of key frames which are extracted by the proposed method and contrast methods. When the key frame extraction rate is 2% to 6%, the average SRD of the proposed method is almost the same as that of MV, MTSS and MA, and it is significantly higher than that of ME. When the key frame extraction rate is between 8% and 12%, the average SRD of the proposed method is about 0.3dB higher than contrast methods. The reason is that the proposed method considers the local and global changes of all object motion states, while contrast methods only focus on the global motion state changes. It can be concluded that the proposed method is superior to contrast methods in SRD criterion, and the proposed method can capture the local motion state changes of each object better. Therefore, the above discussion demonstrates that the proposed method is effective.

Conclusions
This paper proposed a novel center offset-based extraction method to extract the key frame in the surveillance video. The center offset is used to capture the global and local motion state changes of moving objects. In other words, it means to replace the object with the center point of the moving target. When there are multiple objects in the video frame, this method calculates the mean value of the center point of these moving targets as the center point of the video frame. Next, calculate the center offset of each frame and then connect them to form a center offset curve. Finally, extract the video frame at the peak mutation as the key frame. Experimental results demonstrate that the proposed method outperforms the existing state-of-the-art methods in capturing the local motion state changes of moving objects.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.