PGCA-Net: Progressively Aggregating Hierarchical Features with the Pyramid Guided Channel Attention for Saliency Detection

the CNN. Combining such ABSTRACT The Salient object detection aims to segment out the most visually distinctive objects in an image, which is a challenging task in computer vision. In this paper, we present the PGCA-Net equipped with the pyramid guided channel attention fusion block (PGCAFB) for the saliency detection task. Given an input image, the hierarchical features are extracted using a deep convolutional neural network (DCNN), then starting from the highest-level semantic features, we stage-by-stage restore the spatial saliency details by aggregating the lower-level detailed features. Since for the weak discriminative ability of the shallow detailed features, directly introducing them to the semantic features will only lead to sub-optimal results. Thus, we take a novel pyramid channel attention mechanism to attend to the useful detailed shallow feature channels before aggregation. The experimental results show that our proposed method outperforms its competitors on 5 benchmark testing sets.


INTRODUCTION
THE salient object detection needs to highlight the most visually attractive regions in an image. Detecting the salient objects can contribute to many applications, such as the semantic segmentation with weak supervision (Lai et. al. (2017) and the visual object tracking (Hong et. al. (2015). However, the saliency detection is very challenging, because finding salient objects requires high-level understanding of the whole image, while accurately segmenting out salient regions also matters.
The traditional methods (Liu et al. (2011), Yang et. al. (2013, Zhu et al. (2014) for saliency detection are based on hand-crafted features, which are related to contrast, texture and color, or heuristic priors to detect the salient objects in an image. Since saliency detection requires high-level understanding of the input image, these hand-crafted features or priors usually fail for their weak representation ability degrading the practicability of the saliency detection.
In recent years, many works based on deep learning are explored for saliency detection after witnessing the great success of the deep convolutional neural networks (DCNNs) on some mainstream computer vision works, such as object recognition (He et al. (2016), semantic segmentation (Long et al. (2015) and object detection . Many methods (Zhao et al. (2015), Li et al. (2015), Long et al. (2015), Li et al. (2015) were proposed to generate accurate detection results. Despite the much more high-level image understanding of the deep convolutional neural network than the hand-crafted features or priors, the saliency detection results, or the so called saliency maps, are very coarse, due to the neglect of the saliency fine details in the process of the down-sampling of the FCNs. Later, work by (Hou et al. (2017), and Wang et al. (2017) combined the semantic coarse feature and the detailed feature together to fully exploit their complementary information to predict fine saliency maps. However, due to the discriminative ability gap of the features on different levels, directly fusing them will only lead to sub-optimal performance. The detailed features tend to have a high activation value on the non-salient regions, because of the small receptive field of neurons in the low-level of the CNN. Combining such noisy features with the semantic coarse features will introduce much non-salient noisy features, further downgrading the detection accuracy.
In this paper, we propose the PGCA-Net with the novel pyramid guided channel attention mechanism to more accurately detect salient regions in the input images by effectively aggregating multi-level features that span across different stages of the CNN. First, we use a deep CNN to extract the hierarchical features from an input image. Then we take the highest-level features as the initial aggregated feature and stage-bystage selectively aggregate the lower-level features to add more salient details to the current aggregated features with the pyramid guided channel attention mechanism. The discriminative ability for the salient objects is different for the features of the different levels. While the low-level features contain abundant saliency details, their discriminative ability is weak due to the small receptive field. Directly including the shallow detailed features to the refined semantic coarse features may bring a strong activation on the non-salient regions, degrading the performance on the saliency detection. To solve this issue, we adopt the channel attention mechanism to reweight the attentive degree of the detailed features, highlighting the most discriminative channels and suppress the noisy channels in the shallow detailed features. While the original channel attention mechanism is based on the global average pooling features, we extend it to a pyramid version to further consider the multi-scale global features to better exploit the high-level discriminative features as the guidance for attention. With the powerful pyramid guided channel attention mechanism, we can integrate the complementary information of the multi-level feature effectively and generate accurate saliency detection results. To strengthen the learning signal for the saliency detection, we take the deep supervision strategy, which imposes the supervision signal to the network at each feature aggregation step.
To verify the rationality of our presented PGCA-Net, we conduct an evaluation on five benchmark saliency detection datasets, comparing it with 16 stateof-the-art methods. It shows that our method outperforms the competitors on the five benchmarks consistently. Our contributions are as follows:  First, we develop a pyramid guided channel attention fusion block (PGCAFB) to more completely utilize the multi-scale saliency abstract coarse features to select the channels of the detailed shallow features and to add the saliency edge information to the semantic coarse features. The PGCAFB brings in only small amounts of extra parameters and computation cost but leads to a considerable performance gain.  Second, we present an PGAC-Net equipped with the pyramid guided channel attention fusion block to conduct the saliency detection.
 Third, our presented network is validated carefully on five benchmark datasets. In all, our method achieves the best performance on all the five datasets compared to the competitors, setting a new state-of-the-art.

Saliency Detection
HERE we discuss the related methods for the single-RGB-image saliency detection, which can be separated into two categories, hand-designed-priorsbased methods, and deep-learning-based modern methods.
The traditional saliency detection methods [] uses manually designed shallow detailed features or priors, such as color, contrast and texture, to detect the salient objects in an image. Such low-level hand-designed methods lack high-level understanding of the images and cannot generalize well on the variant real-world cases. More detailed summary of the traditional methods can be found in the survey paper by (Borji et al. (2015).
In recent years, the community started to apply the deep convolutional neural networks (DCNNs) to the conduct saliency detection. Wang et al. (2016) designed a recurrent prediction network, which recurrently takes the saliency detecting result from the last prediction stage as prior to predict a better saliency map. Zhang et al. (2017) proposed a newly designed dropout technique for saliency detecting on unseen objects. However, these early DCNN-based methods only exploited the deep semantic coarse features of the CNNs and neglected the saliency detail information, which usually spanned on the shallow detailed features from the shallow layers of the CNNs.
Recently several methods were proposed to further enhance the saliency detection results by exploiting the detailed shallow detailed features from the shallow layers of the CNN and the designed mechanism to utilize the complementarity of the detailed features and semantic features. Hou et al. (2017) proposed a network with deep supervision and detailed features for saliency detection. Wang et al. (2017) used the stage-wise mechanism to absorb the high-resolution detailed features to progressively enhance the saliency detection results in several stages. However, such methods, which exploit the low-level detailed feature without selection or filtering leads to sub-optimal detection results. We illustrate our pyramid guided channel attention mechanism to better highlight the beneficial shallow detailed features and suppress the noisy harmful shallow detailed features.

The Channel Attention Mechanism
Attention is commonly utilized in many computer vision areas for its great performance. Recently, Hu et al. (2018) proposed a very light weight but powerful channel attention mechanism and integrated it into each residual block of the ResNet (He et al. (2016) to reach the state-of-the-art image recognition accuracy. The channel attention mechanism takes as input a feature map, then applies the squeeze operation to squeeze it to a 1 1 feature map and uses two fully connected layers to transform it to an attention vector, followed by feature channels highlighting and suppressing using the attention vector. The novel mechanism introduces a very small computation cost and memory footprint but brings observable performance gain. Our method will also take the channel attention mechanism but extends it to the pyramid guided channel attention mechanism to better utilize the semantic abstract features to guide the selection of the shallow detailed features for the salient object detection.

METHOD
WE show the architecture of our presented PGAC-Net framework for saliency detection in Figure 1. To begin with, we extract a multi-level hierarchical feature using the FCN. Feature maps from different levels are with different resolutions and channels. The resolution and channel numbers are high and small for the shallow detailed feature maps and low and large for the semantic coarse feature maps. Then we develop pyramid guided channel attention fusion blocks (PGCAFBs) to intelligently integrate the feature maps from the highest-level to the lowest level. In a PGCAFB, a shallow detailed feature and a semantic coarse feature are accepted as the input, and the semantic coarse feature gives guidance to automatically select the beneficial feature channels of the shallow detailed feature, followed by using the selected shallow detailed feature channels to refine the semantic coarse feature in a residual refinement manner. We densely impose the supervision signal after each PGCAB refinement to boost the optimization convergence and learn about more powerful saliency features. We take the final integrated features to generate the saliency maps as the output of our model. In the following section, we elaborate on how to form our proposed pyramid guided channel attention block (PGCAFB) and then utilize the PGCAFBs to build our attention-guided feature aggregation network PGCA-Net for accurate saliency detection.

The Pyramid Guided Channel Attention Fusion Block (PGCAFB)
Due to the small receptive fields and weak discriminative ability of neurons in the shallow layers of the CNN, the shallow detailed features are easy to mistake to show strong activation on the non-salient regions, although they contain abundant detailed information. Directly using such noisy shallow detailed features to introduce the details will lead to the sub-optimal saliency detection results. Since not all channels of the shallow detailed feature maps are helpful, it is necessary to distinguish the beneficial channels, which are relatively robust from all the channels of the shallow detailed features before learning the complementary features from the semantic coarse features and the shallow detailed features. To achieve this, we apply the channel attention mechanism as our basis to selectively utilize the shallow detailed feature. The channel attention mechanism learns the dependency of the different channels and discriminatively attends to the useful channels and neglects the noisy channels. Note that the semantic coarse features have strong discriminative abilities, because they are from deeper layers. We use the semantic coarse features as guidance to guide the shallow detailed feature selection, towards more robust results, thus, based on our basis channel attention mechanism, we simultaneously take as input the shallow detailed features and the semantic coarse features, and apply the channel attention to predict attention to impose on the shallow detailed features. Realizing that the original channel attention is based on the spatial aggregation features of only a single global scale (note that the input feature maps are first squeezed to 1 1 feature maps by using the global average pooling), we argue that taking into consideration more global scales will further boost the detection accuracy with only small amounts of extra computation cost, because more information with different global degrees are introduced. In this way, the original channel attention is extended to the pyramid version with the multibranch. In the different branch, input feature maps are globally aggregated in the spatial dimension to different spatial sizes (we empirically set the set of spatial size as 1 1, 2 2, 3 3 and 6 6).
Based on the discussion above, we develop a pyramid guided channel attention fusion block (PGCAFB) to better learn the complementary information of the shallow detailed features and the semantic coarse features. See Figure 2 for details. The PGCAFB begins with using the semantic coarse feature as guidance to discriminatively attend to the useful shallow detailed feature channel and supress the harmful shallow detailed feature channels by the pyramid channel attention mechanism, then takes the semantic coarse features and the attentive shallow detailed features as input to learn a residual to add to the semantic coarse features.

The Attention-guided Feature Aggregation Network (PGCA-Net)
To fully exploit the multi-level features to generate the satisfactory saliency detection results, we developed an attention-guided feature aggregation network to progressively aggregate features of each level by utilizing our proposed PGCAB. As shown in Figure 1, given an input image, we first use a ResNeXt50-32x4d network (Xie et al. (2017) to extract the hierarchical multi-level features, the multilevel features are complementary. For the lower-level features, they are in a higher resolution and contain more saliency detail information but have weaker semantic discriminative ability due to the shallower layers. For the higher-level features, they are in a lower resolution and the lack of spatial details are due to the down-sampling operation during the forward process, but they are more discriminative for roughly finding where the salient objects are. To achieve more precise saliency detection, we need to integrate the detail information of the shallow detailed features into the semantic coarse features to help locate the boundary of the salient objects more clearly, resulting in better saliency results.
We progressively aggregate the features from the highest level to the lowest level by using our presented PGCAFB. After considering the balance between the performance gain and computation cost, we neglect the features of the level at the beginning. Thus, we finish the top-down features aggregation using the 3 steps. The PGCAFBs used in all aggregation steps are not shared for stronger learning ability.
To further strengthen the gradient for optimizing and to learn more about the robust salient features, we apply the deep supervision mechanism to impose the supervision signal to each aggregation step during the training process. By directly giving the supervision signal to each step, the PGCAFB can try its best to aggregate all the useful features to reach accurate detection results. During testing, we only take the saliency maps predicted using the final aggregated features as the output results.

EXPERIMENTS
WE will first illustrate the datasets and the evaluation metrics, then introduce the training and testing strategies of our presented framework, and lastly report the numerical results.

Datasets
We adopt the commonly used MSRA10K dataset  as our training set, which contains 10,000 image pairs. To fully evaluate the performance of the saliency methods, we take five widely used saliency benchmark datasets for testing, including ECSSD (Yan et al. (2013) with 1,000 image pairs, HKU-IS  with 4,447 image pairs, PASCAL-S (Zhang et al. (2017) with 850 image pairs, SOD (Hou et al. (2017) with 300 image pairs, and DUT-OMRON (Yang et al. (2013) with 5,168 image pairs.

Evaluation Metrics
We adopt two widely used saliency evaluation metrics, mean absolute error (MAE) and F-measure ( ), in our experiment. MAE directly compares the predicted results with the corresponding ground truth using the absolute distance. has been proven to be consistent with the visual perception of humans (Yan et al. (2013). The MAE and the are formulated as follows: (1) (2) where S and G denotes the predicted saliency map and the ground truth saliency map, H and W denotes the height and width of the saliency map, and is set as 0.3 to emphasize the precision.

The Training Parameters
We implement our presented method based on the PyTorch framework. To boost the optimization convergence and reduce over-fitting the risk, we use the ResNeXt50-32x4d well trained on the ImageNet (Deng et al. (2009) classification task to initialize the feature extraction network of our framework, and initialize other parameters using the msra initializer . Our presented method is trained on the commonly used MSRA10K dataset. The stochastic gradient descent (SGD), is utilized to adjust the network for 10,000 times. On each training step, we bi-linearly scale each input image to the size of 416 416 and conduct a random horizontal flipping before feeding it to the network. During the training process, we tune the learning rate (Liu et al. (2015) with the initial step size of 0.001 and the power is 0.9. It takes approximately one and half hours to train our proposed model using one GTX 1080Ti GPU.

Testing
During the testing stage, our method generates the corresponding saliency map. To further refine the spatial coherence of the predicted saliency maps and lead to more satisfactory results, we apply the conditional fully connected field (CRF) (Krähenbühl et al. (2011) to the detection results.

The Ablation Study
To verify the effectiveness of the method, the ablation study is executed on the major components of the PGAC-Net. The numerical results can be seen in Table 1. The "w/o PGCAFB" in the first row is the performance of the method similar to our presented PGCA-Net but replaces all of the PGCAFBs with the simple block without the attentional selection of the shallow detailed features before fusion of the semantic coarse and shallow detailed feature. We also compared our method to the "w PGCAFB_1" whose PGCAFBs aggregating the input feature maps to only one spatial size of 1 1 (original channel attention design) to show the usefulness of the integrate information of more global scales to conduct the channel attention (pyramid channel attention).
From the results in Table 1 and the visual comparisons in Figure, we find that: (1) the introducing channel attention mechanism is helpful to the multi-level feature integration, because of the discriminative ability gap of features in different levels, by comparing the "w PGCAFB_1" to "w/o PGCAFB", and (2) considering the multiple global scale to the conduct channel attention (pyramid channel attention) could further enhance the detection accuracy.

The Comparison with the State-of-the-arts
Furthermore, our method compares our presented PGCA-Net to the 16 state-of-the-art saliency detectors, which is shown in Table 2. The methods in the first three rows, "wCtr" (Zhu et al. (2014), and "BSCA" (Qin et al. (2015) are traditional saliency detection methods, which are based on the handcrafted features or priors. The other methods are based on the deep convolutional neural networks. To conduct fair comparison, we obtained the saliency detection results of other methods by downloading the results offered by authors or retrained their models using the same training set with the hyper-parameters reported in their papers.
Our methods show the MSRA10K dataset and is tested on the other five datasets. From Table 2, our PGCA-Net outperforms all its competitors on all the five benchmark datasets, and from view of the and MAE metrics showing the great generalization effect of the PGCA-Net.

CONCLUSION
TO finalize, we put forward the PGCA-Net with the novel pyramid guided channel attention fusion block (PGCAFB) to robustly and accurately segment out the salient objects from an input image. The highlevel features are used as guidance to dynamically select the lower-level features, which contain abundant salient detail features. We adopt the channel attention mechanism to conduct the feature selection process for the purpose of obtaining performance gain while increasing only small amounts of extra computation cost. To exploit the multi-scale spatial context in the features and further boost the performance, we extended the original channel attention to the pyramid channel attention with ignorable extra parameters and computation amounts. The presented network achieves performance on the salient object detection.

DISCLOSURE STATEMENT
NO potential conflict of interest was reported by the authors.