Video summarization is applied to reduce redundancy and develop a concise representation of key frames in the video, more recently, video summaries have been used through visual attention modeling. In these schemes, the frames that stand out visually are extracted as key frames based on human attention modeling theories. The schemes for modeling visual attention have proven to be effective for video summaries. Nevertheless, the high cost of computing in such techniques restricts their usability in everyday situations. In this context, we propose a method based on KFE (key frame extraction) technique, which is recommended based on an efficient and accurate visual attention model. The calculation effort is minimized by utilizing dynamic visual highlighting based on the temporal gradient instead of the traditional optical flow techniques. In addition, an efficient technique using a discrete cosine transformation is utilized for the static visual salience. The dynamic and static visual attention metrics are merged by means of a non-linear weighted fusion technique. Results of the system are compared with some existing state-of-the-art techniques for the betterment of accuracy. The experimental results of our proposed model indicate the efficiency and high standard in terms of the key frames extraction as output.
KFE and video skimming are two fundamental techniques for the summarization of videos [
The present paper focuses on the extraction of key frames from the videos. It must be possible to summarize video schemes at the top level of the semantic video content as well as feasible objects, occasions, and activities. As a rule, the extraction of the semantic natives isn't achievable. Nonetheless, some space specific procedures have been anticipated. To close the semantic gap, several researchers [
This paper proposes the efficient KFE pattern that rely on visual attention. This system calculates static as well as dynamic visual viewing displays and then merges them into separate key-frames in a non-direct manner. The static viewing model uses the image signature on the basis of the salience identification [
Remaining part of the paper has the following structure. Section II of the paper introduces the related study, Section III outlines the proposed mechanism, section IV presents the experimental outcomes and section V presents the conclusion of the paper.
There are several specific procedures for the key-frames extraction, using semantic highlights of the videos at an elevated level. For example, Chen et al, Summarized basketball clips on the basis of programmed scenario investigation and determination of the camera perspective. Calic et al. [
Ma et al. [
The base-up system is animated in reaction to low-level characteristics (texture, color, movement) those differ visually from the remaining scenario. The instrument of “base up consideration” is the reflexive, autonomous task, temporary and fast. The proposed framework is given in
A model of spatial attention is designed by calculating visual salience on the basis of a description of images known as “image signature”. The signature of the image may be utilized to estimate the image of the foreground [
A certain video frame “F” is first reduced to size 63∼49. Next the image tag “
The sign (.) represents the operator of the input sign, DCT is the discrete cosine transformation and the channel color “c” of the frame “F” is represented by “Fc”. The image signature is converted in the spatial range by an inverse DCT to get the recovered image “
The static salience map of “
“S” is the standard deviation of the distribution, whose value is assumed to be 0.045. The saliency map of every color channel is summed up linearly to obtain the total static saliency map “
The CIELAB color area is utilized for the selection of color channels due to its capability to effectively approximate people’s vision. The saliency chart “S(F)” is then standardized from 0 to 1 by splitting every value by the highest value available in the chart. The mean of the null values in the saliency chart “S(F)” is used to get the static attention value “as” of a frame. When the value of “as” is near one, the frame is regarded as salient. Conversely, a value of “as” close to zero shows an unremarkable frame.
In videos, people incline to focus extra attention on the movement of things relative to one another. To get the motion data in video streams quickly, the idea of temporal gradients is used. In this manner, the movement information is subtly calculated by taking measurements of the temporal variations of the values of a pixel in adjacent frames. This feature makes them suitable for use in online systems.
There are two frames of the video
After calculating the gradient vector for every pixel in frame
By calculating the salience value at every pixel, the temporal salience map
In most cases, researchers have used linear fusion schemes for the fusion of numerous attention values to make an overall attention value. Given that there are “n” number of attention values that need to be combined, the general shape of linear fusion schemes is as follows:
In the literature on visual attention-based video summary patterns, the authors have utilized key-frame-based video summaries utilizing visual attention cues and linear fusion schemes with a greater weight of movement attention scores compared to static schemes.
Within frame “F”, denoted by “
If the motion contrast in
In case “
“W” is determined as
Max's fusion model chooses the highest two attention indexes to be merged. MAX's fusion model satisfies the feature of inequality (14). Yet this trivial feature of attention functionality is infringed by the Max Fusion:
The fusion scheme of
The fused attention score of every frame is utilized to create the attention graph representing a video and then utilized for KFE. When the key frame number “
First, the technology outcomes were displayed on a single shot, that was taken from the Open Video Project (
The 1st series of tests is the 5th recording (frames 484–520) of the ucomp03_06_m1.mpeg video. Tennis player strikes the ball, stands up, and gains the credit of the crowd.
Sequence of 2nd video test is the 2nd shot (frames 532–548) of the hcil2000_01.mpeg video. A subject speaks and stands in the frame under consideration, with the surrounding trees. From frame 545, a subtitle appears in the scenario to show an introduction by the narrator. A key frame that is representative of the scene has to display the people and the caption. The attention graphs are illustrated in
Here, this unit matches the proposed system with several of the outstanding schemes relying on non-visual attention as well as visual attention. For comparison purposes, the experiment was carried out with twenty videos of different types, which were downloaded directly from the Open Video Project.
Multiple approaches were used to compare the results. One is based on the well-known measurement categories F-measure, precision, and recall. And another one is the subjective MOS method (Mean Opinion Score) for evaluation pattern.
The initial assessment procedure involves manually extracting the key frames for each video using 3 human users. The two frames are assumed to be identical if they carry identical semantic content. The terms below are then described:
These words are used for the definition of the metrics Precision and Recall.
To obtain a combined single metric, both Precision and Recall are merged using the following f-measure definition:
S.no | Name of the video | Total number of frames |
---|---|---|
01 | From shots 03 of 8 in wetlands regained | 3563 |
02 | A Digital personal scale in technology at home | 3345 |
03 | Outline toward HCIL “2000” reports | 2453 |
04 | From shots 05 of 14 in ocean floor legacy | 4664 |
05 | In shot 01 of the great web of water | 3278 |
06 | In shot 02 of the great web of water | 2117 |
07 | In shot 07 of the great web of water | 1744 |
08 | In shot 01 of a new horizon | 1805 |
09 | In shot 02 of a new horizon | 1796 |
10 | In shot 06 of a new horizon | 1943 |
This section provides a comparison between the proposed approach and 4 outstanding non-visual attention patterns: DT [
S.no | In [ |
In [ |
In [ |
Proposed Method | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F | P | R | F | P | R | F | P | R | F | |
1 | 0.75 | 0.83 | 0.79 | 0.70 | 0.82 | 0.75 | 0.80 | 0.81 | 0.80 | 0.80 | 0.84 | 0.82 |
2 | 0.75 | 0.75 | 0.75 | 0.73 | 0.73 | 0.73 | 0.85 | 0.90 | 0.87 | 0.82 | 0.85 | 0.83 |
3 | 0.70 | 0.85 | 0.77 | 0.72 | 0.80 | 0.76 | 0.82 | 0.86 | 0.84 | 0.82 | 0.83 | 0.82 |
4 | 0.85 | 0.86 | 0.85 | 0.85 | 0.83 | 0.84 | 0.80 | 0.83 | 0.81 | 0.80 | 0.83 | 0.81 |
5 | 0.75 | 0.83 | 0.79 | 0.75 | 0.80 | 0.77 | 0.83 | 0.80 | 0.81 | 0.75 | 0.82 | 0.78 |
6 | 0.73 | 0.88 | 0.79 | 0.70 | 0.83 | 0.76 | 0.75 | 0.85 | 0.80 | 0.85 | 0.90 | 0.87 |
7 | 0.75 | 0.75 | 0.75 | 0.72 | 0.83 | 0.77 | 0.70 | 0.80 | 0.75 | 0.80 | 0.81 | 0.80 |
8 | 0.75 | 0.83 | 0.79 | 0.75 | 0.81 | 0.78 | 0.72 | 0.83 | 0.77 | 0.80 | 0.85 | 0.82 |
9 | 0.82 | 0.88 | 0.84 | 0.82 | 0.85 | 0.83 | 0.78 | 0.85 | 0.81 | 0.85 | 0.87 | 0.86 |
10 | 0.80 | 0.82 | 0.81 | 0.82 | 0.80 | 0.81 | 0.80 | 0.83 | 0.81 | 0.81 | 0.83 | 0.82 |
However, several exceptions exist. For example, the DT scheme for Video 5 attains a high level of precision. Yet for that video, DT chooses only 1 key-frame, so the values of
Moreover, within the schemes based on visual attention, the outcome of the proposed technique is comparable to the rest of the other mechanisms. Similar findings can be extracted from
Therefore, it was changed from 1,000 to 6,000 frames for the duration at which the videos to summarized. In the proposed methodology, an optional pre-sampling step can be used if the computational effort is to be further reduced.
S.no | In [ |
In [ |
In [ |
Our method |
---|---|---|---|---|
01 | 4.08 | 4.25 | 4.03 | 4.16 |
02 | 4.13 | 4.41 | 4.1 | 3.99 |
03 | 4.4 | 4.06 | 4 | 4.16 |
04 | 4.15 | 4.24 | 4 | 4.31 |
05 | 4.1 | 4.09 | 4.1 | 4.25 |
06 | 3.99 | 4.11 | 4.1 | 4.06 |
07 | 4.15 | 4.06 | 4.13 | 4.19 |
08 | 4 | 4.14 | 4.11 | 4.19 |
09 | 4.18 | 4.16 | 4 | 4.31 |
10 | 4.3 | 4.22 | 4.13 | 4.4 |
Method | Generated Key Frames |
---|---|
Ground Truth | |
[ |
|
[ |
|
[ |
|
[ |
|
[ |
|
[ |
|
Our Method |
Finally, the key-frames extracted through different patterns are displayed visually in
The proposed technique is evaluated based on the F-measure, recall, and precision. The formula used for these measurements are similar to conventional VS techniques.
Compression on the basis of the OV dataset with alternative available techniques is assessed with the F-measure, recall, and precision. Precision uses for the accuracy of a technique and computes the count of wrong extraction key-frames. The recall value displays the possiblility of all key-frames that are available in the basic truth. We performed validation of our technique utilizing 2 benchmark video datasets by comparing our findings to the prior art of VS techniques.
Dataset is called OV (Open Video Project) and comprises videos with a standard RGB form with 30 FPS and 352 × 240 pixels. This dataset includes different types of videos, e.g., documentaries, surveillance videos, educational videos, historical videos, ephemeral, and lectures videos [
There are 50 videos in this dataset of various types, i.e., surveillance and sports videos, animated videos, TV home videos, and commercials videos with a total duration of one to ten minutes. We compared the outcome with VSUMM, five-user summaries, and the concept of Fei et al. [
They segment the recordings using a perceptual hashing technique, which is insufficient for monitoring streams and has limited performance.
Besides the quantitative measurement, it is essential to evaluate achievement on the basis of a subjective or qualitative assessment.
The MOS is a subjective assessment metric utilized to evaluate the compiled results of various VS techniques. The MOS returns the opinion of the users directly and displays their areas of concern.
Video No. | In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
Our method |
---|---|---|---|---|---|---|---|---|---|---|
1 | 4.26 | 4.18 | 4.24 | 3.08 | 2.50 | 4.52 | 4.04 | 4.04 | 4.50 | 4.60 |
2 | 4.35 | 4.2 | 4.28 | 2.88 | 3.62 | 4.12 | 3.86 | 4.47 | 4.49 | 4.50 |
3 | 4.24 | 4.05 | 4.3 | 2.42 | 3.82 | 3.48 | 4.19 | 4.06 | 4.22 | 4.50 |
4 | 4.16 | 4.6 | 4.28 | 2.76 | 3.20 | 2.71 | 3.4.00 | 3.91 | 4.22 | 4.50 |
5 | 4.26 | 4.08 | 4.66 | 3.08 | 3.37 | 2.72 | 4.02 | 4.08 | 4.44 | 4.45 |
6 | 4.15 | 4.28 | 4.39 | 3.82 | 3.64 | 3.30 | 4.47 | 4.39 | 4.51 | 4.60 |
7 | 4.27 | 4.26 | 4.25 | 3.50 | 3.68 | 3.57 | 3.82 | 4.25 | 4.35 | 4.40 |
8 | 4.15 | 4.2 | 4.15 | 3.50 | 3.83 | 3.56 | 2.80 | 4.28 | 4.48 | 4.19 |
9 | 4.07 | 4.36 | 4.6 | 3.38 | 3.12 | 3.14 | 3.29 | 4 | 4.18 | 4.19 |
10 | 4.28 | 4.17 | 4.2 | 3.36 | 3.46 | 3.44 | 3.84 | 4.14 | 4.02 | 4.05 |
Shot |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
In [ |
Proposed Method |
---|---|---|---|---|---|---|---|---|---|
Motion attention model | Visual attention model | Visual attention model | Color features | Visual saliency | Motion attention model | Deep |
Object motion | Deep |
|
Linear | Linear | Linear | None | Non- |
None | Hierarchical | None | Non- |
|
× | ✓ | ✓ | × | ✓ | × | ✓ | × | ✓ | |
× | × | × | ✓ | × | × | ✓ | ✓ | ✓ | |
× | × | × | × | ✓ | × | ✓ | × | ✓ | |
× | × | × | × | × | ✓ | ✓ | × | ✓ |
Computing intricacy is an essential measurement for assessing VS techniques and specifically for monitoring video gathered in a limited resource environment. With this in mind, we have evaluated the duration of our strategy and compared the intricacy of our approach with similar methods. For this aim, we have looked at various videos with frames from 1,000 to 6,000. The mean running time for VS representative techniques is 304.61, 249.27, 277.84 and 123.97 s for [
In this paper, we recommend an effective frame-work that relies on visual attention for KFE from video. The method not only delivers effective outcomes but is also appropriate for usage in small devices. Using temporal gradients offers an effective substitute for the traditional flow-oriented optical characteristics used so far. Using a nonlinear weighted fusion pattern adds all the advantages of the earlier used patterns. In general, the framework requires far less time than the recent patterns on the basis of visual attention. The experimental outcomes, on the basis of a set of criteria, indicate that the extracted key-frames utilizing the proposed pattern are related semantically and more strongly focused on highlighting than the ones produced by the alternative methods with which it is evaluated.
This work was supported by the Qatar National Library, Doha, Qatar, and in part by the QU Internal Grant Qatar University Internal under Grant IRCC-2021-010.