The Application of Sparse Reconstruction Algorithm for Improving Background Dictionary in Visual Saliency Detection

In the paper


INTRODUCTION
THE visual saliency detection is to enable computers to detect visual attention areas as quickly and accurately as the human visual system.If the computer can accurately detect the visual saliency area, it can effectively allocate the computing resources, focus on the visual saliency area, and greatly improve the efficiency of the computing resources utilization.It originates from the human visual attention mechanism and is dedicated to finding out the most interesting regions of the human eye.The human can easily identify areas of significance in an image and notice the important parts of the image.If a person can focus on an area in an image in a very short time, that area is the salient area of the image.Each image contains one or more salient objects, and saliency detection is to imitate the visual attention mechanism to obtain important information.In the process of the image target detection, image features are usually extracted, and then matched with the corresponding template to determine the target to be detected.If the visual saliency region can be judged beforehand, it can be sampled directly in the visual saliency region when extracting features.In this way, the interference of the non-target regions on the feature extraction is reduced, and the accuracy of the important image processing tasks such as target detection, recognition and classification are improved.If the function of the visual saliency detection is added to the device, the versatility will be increased, and it will have broad application prospects in the field of intelligent devices.Koch (1985) put forward the concept of the salient graph and thought that the binary image was used to represent the saliency map.This model has a farreaching impact on subsequent research algorithms.The Itti model (1998) was based on the human visual cognitive system to simulate human perception characteristics of color, brightness and direction information.At the same time, the model used the difference between the underlying information of these images and the surrounding background area to measure the saliency.The graph-based saliency detection method ( 2006) used Itti's model framework and basic principles.The Markov random field model was used to simulate the complex attention transfer process in perceiving image information.This method had a strong correlation with the visual attention mechanism and was a very representative saliency detection method.The method proposed by Hou et al. (2007) calculated the saliency maps from the frequency domain perspective according to the idea of the residual spectrum.First, the statistical characteristics of the Fourier spectrum of the natural background was used, and then the saliency map was

ABSTRACT
In the paper, we apply the sparse reconstruction algorithm of improved background dictionary to saliency detection.Firstly, after super-pixel segmentation, two bottom features are extracted: the color information of LAB and the texture features of the image by Gabor filter.Secondly, the convex hull theory is used to remove object region in boundary region, and K-means clustering algorithm is used to continue to simplify the background dictionary.Finally, the saliency map is obtained by calculating the reconstruction error.Compared with the mainstream algorithms, the accuracy and efficiency of this algorithm are better than those of other algorithms.calculated by the inverse Fourier transform of the residual spectrum.This method was too simple to detect small targets, because salient objects tend to be sparse in natural scenes and sparse representation has been used in the saliency detection.A context-based model ( 2014) was the successful application of the iterative multi-scale cumulative process in the field of the visual saliency detection.Huo et al. (2016) used global cues and local objective cues to calculate the saliency of the whole image and candidate objects respectively, and the multi-cue fusion method highlights the saliency of the image and suppresses the interference.
There are two innovations in this paper.First, the texture feature is a basic feature that the human visual system can feel directly.The human visual mechanism is like the method based on the multi-channel filtering, so the Gabor filter is used to extract texture features.The appearance value of the Gabor amplitude feature is transformed into the strength value by the principal component analysis, then it is combined with the LAB color space and the position space to form an eigenvector.Second, the background dictionary is constructed from a single image by using the background priori, but the foreground image is often included in the boundary super-pixel.In this paper, after removing the foreground area, the remaining boundary super-pixels are used as the background dictionary.The redundancy in the background dictionary can be further reduced, and the super-pixels in the template can be more discriminant, so that the foreground of the image is prominent.

COLOR FEATURE INFORMATION
THE RGB color space is used in most display devices but the RGB color space has some shortcomings in the color description: (1) The color is linearly combined by R, G and B, so it is difficult to express different colors with precise values.(2) In the RGB color space, the linear combination of the R, G and B results in the high correlation among components.If the brightness of the natural image changes, the three components will change accordingly.(3) The uniformity of the RGB color space is poor, and the sensitivity of the human eye to the three color components is different.It leads to a large deviation between the color and the human vision.
The LAB color space is formed by the second encoding of the human eye to color.Any value in the Lab color space represents a color, and the value is independent of the device you use.The color space of the LAB forms signals of three new channels: Black white (B-W), red green (R-G) and yellow blue (Y-R).Among them, the B-W channel is named the L channel with a value range of [0-255], and the range of the R-G and Y-R channel is [-127~128].Because the use of the RGB and the CIEXYZ color space is complex and not intuitive, the CIElab color space contains almost all the colors that human beings can feel.The XYZ links the RGB and the Lab as intermediate quantities: (1) where, , , ,X, Y, Z are the values of their respective color spaces., , generally are 95.047,100.0, 108.883.
Since in the Lab color space, the color information is contained in a and b components, the color space conversion better enables the color blocks in the image to be clustered in the following k-means clustering analysis.

The Gabor Filter
THE Gabor wavelet transform simulates the human visual system.It magnifies the image feature (local texture information).It measures the direction and frequency characteristics in the spatial and frequency domain.In the image processing, the Gabor wavelet transform cannot only carry out multi-scale and multidirectional analysis to capture the local structure information of the image but also has good robustness to illumination change and image deformation.Moreover, the Gabor wavelet has low sensitivity to light changes and can adapt better to the phenomenon of uneven light in the image above the characteristics proving that the Gabor wavelet can be widely used in the visual system, making the two-dimensional Gabor filter a tool for the multi-scale and multi-direction representation of images.It realizes the optimal time of the signal, and the spatial position, frequency, direction, phase and the phase bandwidth of the twodimensional signal can be obtained from the image processed by the two-dimensional Gabor filter.The texture features are filtered as approximate periodic signals.The Gabor filter can extract high quality local information and frequency domain information of targets.Its formulaic expression is as follows: where, 1  represents the wavelength of the Gabor function, and  represents the filtering Angle of the filter. 1  is a constant, usually set to half the size of the filter, and  denotes the aspect ratio and determines the ellipticity of the Gabor function shape.

The Gabor Filter Parameter Selection
The two-dimensional Gabor filter belongs to the band pass filter.It can describe the local information of the spatial location, scale selectivity and direction selectivity with high quality.The two-dimensional Gabor filters generate a two-dimensional Gabor wavelet, which can extract texture features of images through scaling and rotation.The selection of the filter parameters is the focus of the feature extraction.For this reason, we use the Gabor filter Banks with multiple center frequencies and directions to describe the image.It can be seen from the above that the effect of two-dimensional Gabor filter is determined by the parameters of image direction and scale.In this paper, the Gabor filter is applied to the response graph of the direction   According to theoretical analysis, the more filtering directions or scale levels are selected, the better the final effect will be.However, the number of filters will increase, and the real-time performance will be reduced.
In order to show more intuitively that the Gabor filter can effectively extract the texture features of the image in different directions, this paper selects the images with the obvious texture directions from the Brodatz texture library.In Figure 2, the figure is the extraction effects of the texture features in their respective directions.

From the Gabor Amplitude Image to the Gabor Characteristics
(1) Each Gabor amplitude image contains some local variations.The simple Gabor low-pass filtering can be used to compensate for these changes to the smooth Gabor amplitude information.
(2) The function set of the spatial information includes not only the data points represented by the pixels but also the features represented by the plane.Each filter in the Gabor filter bank has a separate function and the two additional functions of the spatial information.Finally, each pixel in the input image has 24 Gabor features and 2 spatial features.
(3) The normalization of the features contained in each pixel are obtained.In order to visualize the feature and simplify the calculation, the multi-dimensional array representing each pixel feature is transformed into a numerical value by the principal component analysis.

The Mathematical Form of the Sparse Reconstruction
THE standard sparse coding model is used to simulate the human nerve cells.Each element in the image is represented as a linear combination of a set of dictionary basis functions.Let represent the feature descriptor of a sample and U represent the dictionary of an image.As a dictionary, it is generally required that the number of bases should be at least greater than the dimension of the elements in the sample.Under the given conditions, the coding of the sample υ is as follows: (6) where λ is the equilibrium factor of the two equilibrium terms.
, is the base of a dictionary.‖*‖ represents the two norms, and denotes the zero norm.The first item of the formula is to describe the reconstruction error of the original image block and the reconstructed original image block.If its value is small, it shows that the image block reconstructed by the basis function and the coding coefficient can approximate the original image block.This item describes the quality of the coding results.The second item is the sparse item, which represents the sparse degree of coding.The value of λ determines the sparsity of coding υ.
Suppose an image is divided into n super-pixels , then they are divided into S segments .The super-pixel of the image boundary is coded as a dictionary, and its sparse coding u is calculated by the formula.The reconstruction error of this super-pixel is as follows: (7) The algorithm uses the sparse reconstruction error to solve the salient target detection problem.The basic idea of the algorithm is that the reconstruction error of each image region is proportional to the saliency value of the image region when the background template is chosen as the base vector in the reconstruction process.For equation ( 6), a series of sparse coefficients are solved to minimize the corresponding reconstruction error of equation ( 7).The reconstruction error shows the significant score of the target area, to assist the detection of the significant objects and achieve the suppression of the foreground noise.In the process, every pixel region is generated by the SILC segmentation, and the n-dimensional feature representation x is extracted.Then the image boundary area is selected as the background template B, and a sparse coding framework is established based on this.For the i-th region, the background template B is used as the base vector for the sparse representation according to formula (6), and then the sparse reconstruction errors of each region are calculated according to formula (7).For this region, if the reconstruction error calculated by sparse representation of each background template is small, it is like the background and is not salient.For the same set of background templates, the reconstruction errors calculated from the object and background regions are generally quite different.For the relatively complex background in the image, the sparse representation can express the difference.In this paper, the sparse reconstruction error is used to measure whether a super-pixel region is a saliency region and quantify its saliency value.
The larger the reconstruction error of a super-pixel, the more it tends to be part of the salient target.Its segmentation tends to be more salient.If some foreground segmentation is grouped into the background (salient objects appear in the edge of the image), the salient regions will be mis-detected as the background.

The Selection of the Dictionary
In this section, the color-enhanced Harris corner operator is used to detect the corners of the potential target areas in the image.These salient points can confirm the spatial location of the salient targets.These saliency points are encapsulated in the convex hulls to estimate the saliency target regions.Since the saliency objects appear at the edge of an image, the accuracy of saliency detection will be reduced if all the edge regions of the image are used as the background dictionaries.After modifying the part of the boundary area, they are used as a background dictionary.It guarantees more stable and reliable background information.The redundancy in the background dictionary can be further reduced after the boundary super pixels are clustered.
In this paper, the super-pixels contain many pixel units, and the average CIELAB color features.The Gabor texture features, and location information of these pixel units constitute the characteristics of the super-pixels, .The K super-pixel is recorded as .The back ground template is constructed by selecting the super-Pixel at the image boundary from the SK.Although the background template is still used as a dictionary for sparse representation, the background template is preprocessed before the sparse reconstruction.The image boundary can be divided into four directions; upper and lower, left and right.Because the super-pixel features of the image on each boundary may have great similarity, if not processed, the uniqueness of the super-pixel in the background template is not obvious.This will lead to redundancy in the sparse dictionaries, so this paper uses the background template preprocessing to reduce the redundancy in dictionaries.The background templates are classified into four categories according to the four directions.The similar super-pixels of each boundary are merged by the kmeans algorithm, then the merged super-pixel features are represented by the feature mean of the merged super-pixels.The new background template is recorded as .represents the eigenvector of the second feature vector in the sparse dictionary.The basic idea of the sparse representation is to represent as much data as possible with as few dictionary features as possible.After pretreatment, the super-pixels in the background template are more discriminant.The reduction of the superpixel feature in the template means the reduction of the sparse dictionary feature, and it effectively solves the redundancy problem of the dictionary and improves the efficiency of the sparse reconstruction error calculation.

The K-means Algorithm
Because of its unsupervised and fast operation speed, the K-means algorithm is widely used.The data is clustered into different clusters in the algorithm according to the distance difference between the data.
There is an observation set 1 2 ( , ,..., ) 2. The centroid of each class j can be recalculated by formula () ( 1) () The initial central point moves to the position of the average value of the data, and after several iterations, when the central point does not change any more, the clustering center is obtained.

The K -means algorithm
Input: number of clustering K, and data samples N Output: K clusters satisfying the minimum standard deviation 1. Randomly select the K objects from n samples as the initial clustering center.2. While (if the clustering changes).3. Use formula (10) to calculate each cluster center.4. Use formula (9) to calculate the clustering of each sample and its clustering center and re-classify the samples according to the minimum distance.5. End.
The k-means algorithm satisfies formula (9), which obtains the minimum value and ends the algorithm.Because of the association between the image features, in the process of the feature clustering，the saliency details are retained, and the saliency information is extracted.

EXPERIMENT
IN this paper, the super-pixel is used as the basic processing unit, and the SLIC method ( 2010) is used to simplify the image structure information and preserve the edges of the salient map objects.The image block is represented by the average color feature, texture feature value and position coordinate of all the pixels contained in the super-pixel.The features of the super pixel are composed of color features, texture features and location, .The image features are represented by the N super-pixel features , where D denotes the feature dimension.

The Algorithm Comparison and Analysis
In this paper, the MSRA-1000 data set is used for the experiment.The MSRA dataset is provided by the Asian Research Institute.Its subset is one of the benchmark datasets widely used by many saliency detection algorithms.The database was first compiled by Liu et al (2012).Most of the image resolutions in this database is 400*300, and the saliency region is labelled with a rectangular box, Cheng et al. (2014) simplified the MSRA data set, and 10000 images are randomly selected by the precise manual labelling.It referred to as the MSRA-10K.The images in the data set (Figure 5) has the characteristics of the complex background and low contrast of the target area.Many representative methods were chosen to test on the data set.The experimental results of Figure 6 show several classical algorithms, (IT, GB, SR, FT, CA, RC) for comparison.By comparing with the manual annotated binary images, we can visually see whether the algorithm improves the accuracy of the saliency detection.The validity of the algorithm can be quantitatively verified by the recall rate and the F measure. Figure 4 intuitively shows the results of the algorithm from a qualitative point of view.Section 5.2 will select several classical indicators to compare the results from a quantitative point of view.

The Performance Comparison and Analysis
In Figure 4, the sample image is used for the binarization and the acquired binary image is used as the quasi-basis of the saliency region.Precision (P) denotes the proportion of the salient regions detected by the algorithm, recall rate (R) denotes the matching degree between the salient regions of the algorithm and the salient regions of the truth maps, and the Fmeasure is a comprehensive index.The aaccuracy represents the proportion of the salient pixels in the whole region detected by the saliency detection algorithm.The recall rate represents the degree to which the number of salient pixels detected by the saliency detection algorithm agrees with the salient region in the ground truth (GT) of the manual labelling.The F-measure is a harmonic average obtained by combining the accuracy and recall, which reflects the comprehensive quality of the saliency detection.The formula is as follows: value of the threshold is changed between , and the dynamic threshold segmentation of the salient map is achieved to draw the PR curve.Figures 6 and 7 show the obtained accuracy and recall curves.Most of the time, high accuracy and high recall rate needs to use the F-measure as the overall performance evaluation mechanism.
In this paper, the average value of three indicators of the experimental sample graph for statistical analysis is calculated, and their mean values are 85.92%, 76.83%, 0.8526.
Three performance indicators of the Binary Images of the binary images under their respective thresholds were obtained by experiments on 1000 images in the MSRA test database.Figure 6 to 8 show three performance indicators of each algorithm.
Considering the above three objective evaluation criteria, the proposed algorithm is superior to other algorithms.As seen from Figure 5, under the same recall rate, the accuracy of this algorithm is higher than other algorithms.The false positive rate is also a measure index.From this index, the algorithm is superior to other methods, as shown in Figure 6.In Figure 7, the accuracy of the RC method is not much different from that of the proposed algorithm, but the recall rate is low.Because the saliency maps of some objects are enhanced by the RC method, compared with the RC method, the proposed algorithm can also guarantee a higher recall rate while guaranteeing a higher accuracy.The proposed algorithm makes the detection results of the saliency areas improve significantly.It is more consistent with the human visual perception and can highlight the saliency areas.
The robustness of the algorithm is also an important index.The box graph demonstrates the robustness of the algorithm and it is introduced as one of the evaluation criteria in this paper.By finding out the outliers, it directly reflects the symmetry, discreteness and anomaly of the data, and facilitates the comparison of multiple groups of data.

CONCLUSION
THIS paper introduces a sparse reconstruction algorithm of the improved background dictionary.Considering the correlation of the information inside the image, the boundary super-pixels are clustered according to the feature vectors of the super-pixels.The new boundary super-pixel is used as the base template to calculate the sparse reconstruction error.The algorithm further improves the saliency detection index and reduces the interference of the background area on the feature extraction.According to the experimental comparison and data analysis, the saliency detection method based on the clustering analysis is better than the saliency detection method listed in this paper.In the next stage of the research, we plan that our approach is combined with the topdown technique to improve the efficiency and accuracy in the saliency detection.

DISCLOSURE STATEMENT
NO potential conflict of interest was reported by the authors.
Figure. 1.The Flow Chart of the Algorithm.
Figure 1 shows the frequency real part, imaginary part, amplitude and phase information of Gabor kernel function (2008).

Figure. 2 .
Figure. 2. The Real Part of the Gabor Filter the Kernel Frequency Real Part, Imaginary Part and Feature Images.

Figure. 3 .
Figure. 3. The Amplitude of The Gabor FilterThe Figure2and 3 are the real part and the amplitude of the Gabor feature images.The twodimensional Gabor filter has unique properties of spatial locality, spatial frequency and directional selectivity.Each Gabor kernel function can extract significant features of edges and local textures of texture images at different frequencies and directions.According to theoretical analysis, the more filtering directions or scale levels are selected, the better the final effect will be.However, the number of filters will increase, and the real-time performance will be reduced.In order to show more intuitively that the Gabor filter can effectively extract the texture features of the image in different directions, this paper selects the images with the obvious texture directions from the Brodatz texture library.In Figure2, the figure is the extraction effects of the texture features in their respective directions.

Figure. 4 .
Figure. 4. The Texture Feature Information Extraction in Different Directions multi-dimensional vector.The observation values are clustered in the K sets, so formula (8) calculates the closest class to each sample.and assigning each observed value to the class.Calculate the class it should belong to, as shown in the formula 9: Figure 5.The MSRA-10K Original and Manual Labelling Maps.

Figure 7 .
Figure 7.The Precision-Recall Curves of the Saliency Detection algorithms.

Figure 8 .
Figure 8.The Receiver Operating Characteristic Curves of the Saliency Detection Algorithms.

Figure 9 .
Figure 9.The F-measure Bars of the Saliency Detection Algorithms.