HDAM: Heuristic Diﬀerence Attention Module for Convolutional Neural Networks

The attention mechanism is one of the most important priori knowledge to enhance convolutional neural networks. Most attention mechanisms are bound to the convolutional layer and use local or global contextual information to recalibrate the input. This is a popular attention strategy design method. Global contextual information helps the network to consider the overall distribution, while local contextual information is more general. The contextual information makes the network pay attention to the mean or maximum value of a particular receptive ﬁeld. Diﬀerent from the most attention mechanism, this article proposes a novel attention mechanism with the heuristic diﬀerence attention module, HDAM. HDAM’s input recalibration is based on the diﬀerence between the local and global contextual information instead of the mean and maximum values. At the same time, to make diﬀerent layers have a more suitable local receptive ﬁeld size and increase the ﬂexibility of the local receptive ﬁeld design, we use genetic algorithm to heuristically produce local receptive ﬁelds. First, HDAM extracts the mean value of the global and local receptive ﬁelds as the corresponding contextual information. Then the diﬀerence between the global and local contextual information is calculated. Finally HDAM uses this diﬀerence to recalibrate the input. In addition, we use the heuristic ability of genetic algorithm to search for the local receptive ﬁeld size of each layer. Our experiments on CIFAR-10 and CIFAR-100 show that HDAM can use fewer parameters than other attention mechanisms to achieve higher accuracy. We implement HDAM with the Python library, Pytorch, and the code and models will be publicly available.


Introduction
Convolutional Neural Networks (CNNs) [1] have achieved amazing development in the past 10 years.Due to the efficient representation, CNNs have achieved remarkable results in multiple downstream tasks, such as classification [2], detection [3] and segmentation [4].Therefore, efforts to improve representation capabilities have never stopped.For example, in the early days of CNNs, researchers found that the depth [5] of the network has a great impact on the performance of CNNs, because the deeper the network, the richer the high-dimensional information of the network.However, if the network reaches a certain depth, not only the computational cost of the network increases, and the inference is prolonged, but the performance of the network is severely degraded due to the vanishing or explosion of the gradient [6].In addition to deeply affecting the performance of the network, the width of the network is also an important factor that affects the performance of the network.Similar to increasing the depth, increasing the width [7,8] can also improve the representation ability of the network.But it also increases computing consumption and extends the inference time.Cardinality [9] diversifies the style of convolution in the same layer.This method can significantly improve the network without increasing parameters or adding a small number of parameters.All of the above are to improve network performance by stacking, and skip connection [10] is to change the way of information transmission whose advantage is that no additional parameters are required, the gradient explosion and vanishing are solved at the same time, and the convergence speed of the network is improved.However, additional storage space is needed to store the skip-connected part during inference.Although the above methods can improve the performance of CNNs to a certain extent, they consume a lot of computing resources such as memory and floatingpoint calculations.The attention mechanism can improve CNNs with a small number of parameters and zero additional storage requirements.More importantly, the attention mechanism simulates the human visual system, that is, pay more attention to meaningful information rather than meaningless information.
Among them, SENet [11] is one of the most representative attention networks.SENet [11] proposes a channel attention mechanism and calculates the global average of each channel, which is used as contextualinf ormation (CI) to recalibrate the input.This can magnify the global feature of each channel.On this basis, CBAM [12] and BAM [13] additionally consider the spatial attention, and increase the global maximum value to enrich the global CI, so that the network can find what and where to focus on more accurately.To make this attention more general, GENet [14] extracts local CI instead of global CI.On the basis of GENet [14], SPANet [15] extracts local CI from different spatial scales.
These attention strategies use few parameters to enhance the performance of CNNs.They recalibrate input pixels by multiplying the input pixels with global or CI embedding, and integrate the CI into the network's information flow.Therefore, most attention mechanisms use either global CI or local CI.We review the meaning of global and local CI and conclude that global CI represents the average value of the entire image, reflecting the trend of the overall pixel value; the local CI describes the average value of the local receptive field, and represents the average value of the pixel values in a small area of the sample.The two are different, and the animal's visual system pays attention to this difference.The difference in color distribution between objects is a prerequisite for the observer to distinguish and pay special attention.And today's various attention strategies do not take this into consideration.Therefore, this paper proposes a novel attention strategy based on the difference between global and local CI.At the same time, in order to design a more reasonable local receptive field, we adopt a heuristic strategy, that is, to introduce genetic algorithm (GA) [16] to perform a heuristic search for the size of the local receptive field.
Specifically, we first extract global and local CI, obtain their embedding through the shared multi-layer perceptron (MLP) [17] of two layers, calculate the difference between global and local CI according to embedding, and use this difference recalibrate the input.At the same time, we encode the combination of local receptive field sizes of all layers in the network, and search for the best network local receptive field size combination through GA [16].We validate HDAM on CIFAR-10 [18] and CIFAR-100 [18], and used accuracy, number of parameters as the measurement standards.The results show that HDAM surpasses various current state-of-the-art network models.These networks include classic networks, attention networks, and networks based on neural network architecture search (NAS) [19,20].

Related Work
In this part, we will introduce the work related to HDAM from two aspects: Convolutional neural network and Attention mechanism.

Convolutional neural network
In the first decade of the 21st century, limited by hardware equipment, the development of CNNs has been at a low ebb.With the gradual increase in computing power, and due to the success of AlexNet [23] in 2011, the development of CNNs entered the spring.Since then, CNNs have been the main backbone of computer vision and made remarkable achievements.After AlexNet, researchers continued to improve the performance of CNNs.GoogleNet [8] and VGG [7] increased the depth of CNNs, and found that depth is an important factor affecting the performance of CNNs.However, the training of the model needs to be carefully designed, such as the initialization and learning rate settings, otherwise it is difficult to achieve the desired performance.Batch normalization (BN) [21] believes that this is because the convolutional layer in the model fits the input whose distribution is changing in each inference, that is, the input produces an internal covariate shift, so it proposes to normalize the data of each batch.This makes the training of the model easier and the performance is compelling.Although BN can make training easier, the explosion and vanishing of gradient caused by the increase in depth still affect the potential of CNNs.Therefore, the skip connection proposed by ResNet [10] solves this problem by a big margin, because it alleviates the gradient accumulation consequence caused by the chain rule.ResNet [10] provides an efficient network topology template for later CNNs design.In addition to depth, WideResNet [5] based on ResNet [10] believes that expanding the width is also an effective means to improve CNNs.Depth and width are important hyperparameters that affect the performance of CNNs.Besides, the convolution operation affects the performance of CNNs from another perspective.The depthwise separable convolution [24] uses fewer parameters to achieve similar accuracy to the general convolution.This type of convolution is mainly used on mobile devices.ResNeXt [9] uses multiple convolutions of different sizes in the same convolutional layer.Also, without using additional parameters or using few additional parameters, the performance of CNNs has been greatly improved.Different from the above methods, the current design of CNNs network is mainly focused on the performance improvement strategy of CNNs based on the attention mechanism.This paper also proposes a new type of attention network.

Attention mechanism
The attention mechanism simulates how the animal visual system works, that is, paying attention to the more effective part.The performance of the model can be improved without increasing or increasing a few parameters.The attention mechanism mainly extracts the CI of the feature maps, and then multiplies CI back to the network to increase the network's sensitivity to this information.SENet [11] is a typical attention network.It extracts the result of global average pooling as CI.SPANet [15] and GENet [14] extract the local mean as the local CI, which makes the extraction method based on global CI more general.In addition to using the mean value as the CI, CBAM [12] and BAM [13] also use the maximum value as the component of the CI.Different from all existing attention mechanisms, we extract global and local CI at the same time, seek the difference between the two, and pass this difference back to the network.At the same time, in order to find the most suitable local receptive field size, we use GA [16] based on heuristic search for the first time in the field of attention mechanism to generate the most suitable local receptive field size combination.

Proposed Algorithm
In this part, we discuss HDAM in detail.HDAM mainly includes four parts, namely global and local CI extraction, embedding and difference calculation, input recalibration, and best local receptive field search.In order to explain HDAM more accurately, we provide detailed formula derivation.

Contextual information extraction
CI extraction is an important operation of the attention mechanism.CI represents the concentration of a specific receptive field information and is the basis for embedding calculation.
We use the mean to represent the CI of the receptive field.First, we divide the input into non-overlapping patches, and each patch is a receptive field.We calculate the average value of the receptive field on each channel based on the channel, and use this as the CI in the receptive field on each channel.Given Input ∈ R C×H×W , we first reshape the dimension of Input into R P ×C× H× W , where P means the number of local receptive f ields (patches) and H and W denote the height and width of the patch.P equals HW/( H W ). With the addition of global receptive field, the final receptive f ield metric (RF ) is ∈ R (P +1)×C× H× W , then the CI is as follows: where M ean(•) calculates the mean of RF and CI is ∈ R P ×C .If RF is the global receptive filed, it means global CI, otherwise it means local CI.

Embedding and difference calculation
Embedding calculation maps the extracted CI.To control the number of parameters, we use two-layer shared MLP to map the extracted global and local CI.The ReLu [22] activation function is used after the first layer, and the softmax activation function is used after the second layer.Finally, we use cross entropy to calculate the difference between global embedding and local embedding as shown below (for clarity, bias is ignored): where Embedding is ∈ R P ×C and W 1 and W 2 denote the two-layer MLP.Embedding is Local Embedding Global Embedding , where Local Embedding (LE) is ∈ R HW/ H W ×C and Global Embedding (GE) is ∈ R 1×C .The dif f erence coef f icient (DC) is calculated as follows: where DC is ∈ R HW/( H W )×1 .

Recalibration
The recalibration is to multiply the difference coefficient with the input.This process makes the difference between the global and local CI flow into the network in the inference to enrich the subsequent feature processing, so that the gradient carries the difference information to optimize the network parameters.We broadcast each DC obtained into a matrix with the same dimension and shape as its corresponding local receptive f iled, and then the obtained matrix is multiplied by the Input.Finally, we reshape the shape of Output into R C×H×W :

Local receptive field search
In this part, we will elaborate on the working principle of GA's heuristic search in local receptive field design.To facilitate our explanation, we use ResNet-50 [10] as the basic model for our explanation.As we all know, ResNet-50 [10] consists of four units, and each unit consists of several residual blocks.The number of residual blocks in each unit is three, four, six, and three, respectively.We only design HDAM on the input of each residual block.The input spacial sizes of all residual blocks in each of these four units are 16, 16, 8, and 4. Taking the first unit as an example, because the input spacial size of each residual block is 16, the range of the local receptive field size of each block can be [1/16, 1/8, 1/4, 1/2, 0], where the number represents the proportion of the input spacial size, 0 means that HDAM is not used, and the range of the local receptive field in the remaining units are [1/16,1/8,1/4,1/2,0], [1/8,1/4,1/2,0] and [1/4,1/2,0].
We use an array with a length of 16, that is, the sum of the number of blocks in all units, to represent the local receptive field size combination of all patches in ResNet-50 [10].Figure 1 is an example: We represent a combination of local receptive fields as an individual and use GA [16] to search for the best combination.Algorithm 1 shows the entire process of GA [16].

Dataset
We conduct our experiments on the two most popular datasets, CIFAR-10 [18] and CIFAR-100 [18].The CIFAR [18] dataset is collected by Alex Krizhevsky, Vinod Nair and Geoffrey Hinton and is divided in two subsets including CIFAR-10 [18] and CIFAR-100 [18] according to the number of categories.Each subset contains 60,000 images with the of 32 × 32, including 50,000 training images and 10,000 test images.The difference is that CIFAR-10 [18] contains 10 categories of images, each with 6,000 images, of which 5,000 are used for training and 1,000 are used for testing; CIFAR-100 [18] contains 100 categories, each with 600 images, of which 500 are used for training, 100 are used for testing.

Algorithm 1 Local receptive field search
Require: The population size N , the maximal generation number T , the crossover probability µ, the mutation of probability ν.
1: P 0 ← Initialize N arrays as a population using encoding strategy; 2: Decode each individual and generate the corresponding CNN (ResNet-50 [10]).Train and validate each CNNs, then take the highest accuracy as the fitness of each individual in P 0 ; 3: t = 0; 4: while t < T do 5: Qt ← ∅; 6: while |Qt| < N do 7: p 1 , p 2 ← Select two arrays from Pt with binary tournament selection ; 8: q 1 , q 2 ← Generate two arrays by q 1 and q 2 by crossover operation with the probability µ, and mutation operation with the probability ν; 9: Qt ← Qt ∪ q 1 ∪ q 2 ; 10: end while 11: Train and evaluate CNNs' performance in Qt; 12: P t+1 ← Select N arrays from Pt ∪ Qt by environmental selection; 13: t ← t + 1; 14: end while Ensure: The architecture of a ResNet-50 [10] with the best array.

Peer Competitor
To illustrate the superior performance of HDAM, we select a variety of different CNNs models for comparison, including the classic CNNs, the CNNs searched by NAS, and the CNNs with mainstream attention mechanism.The CNN architectures searched by NAS includes the ones searched by semi-automatic NAS and fully automatic NAS.The classic CNN structure includes DenseNet [25], Maxout [26], VGG [7] , Network in Network [27], Highway Network [28], All-CNN [29] and FractalNet [30].The structures searched by semi-automatic NAS include Genetic CNN [31], EAS [32] and Block-QNN-S [33].The structures found by the automatic search include Large-scale Evolution [34], CGP-CNN [35], NAS [36], MetaQNN [37] and AE-CNN [38].CNNs based on the attention mechanism include SE-Net [11] and CBAM [12].Except for the structure of CNNs based on the attention mechanism, we directly use their experimental results in the original paper, because these results are often the best.We retrain CNNs based on the attention mechanism.

Parameter Settings
We use ResNet-50 [10] as our basic model for embedding HDAM.According to the computing resource, two NVIDIA 2080TI graphic processing units (GPUs), we set the population size to 20 and the maximal generation to 20.The crossover and mutation probability are set to 0.9 and 0.2, respectively.We use the SGD with momentum as the optimizer.The momentum and weight decay are set to 0.9 and 5e-4, respectively.A total of 250 epochs is set to train the individuals.The batch size is 128 and the learning rate is shown in Table 1 The training accuracy is  [39].Random cropping fills four zeros on all borders of the image, and then randomly crops the image with a size of 32 × 32.

Experiment Results
To void accidental factors, our experiments are conducted for 5 times, and the average value of these 5 times was taken as the final result.In addition to using accuracy as our evaluation index, the number of parameters is also used as one of the evaluation index.Table 2 shows the experimental results of HDAM on two datasets.The first column of the table is the name of models, the second and third columns are the accuracy of CIFAR-10 [18] and CIFAR-100 [18] on each model, the fourth column is the number of parameters of each model, and the last column is the model category including hand-crafted, semi-automatic, and full-automatic.'-' means that the corresponding model has no public record.
The experimental results show that HDAM obtains the best accuracy of 96.10 on CIFAR-10 [18] and the highest accuracy of 79.79 on CIFAR-100 [18].The accuracy of HDAM on CIFAR-10 [18] is 1.32 higher than the highest accuracy among hand-designed classic CNNs, 0.33 higher than the highest accuracy of CNNs generated by semiautomatic NAS, 0.4 higher than the highest accuracy of CNNs generated by full-automatic NAS and 0.35 higher than the highest accuracy of CNNs based on the attention mechanism.HDAM also obtains the highest accuracy of 79.79 on CIFAR-100 [18], which is 2.09 higher than the highest accuracy of hand-designed CNNs, 0.44 higher than the highest accuracy of CNNs generated by semi-automatic NAS, 0.64 higher than the highest accuracy generated by full-automatic NAS, and 0.53 higher than the highest accuracy of CNNs based on the attention mechanism.In terms of the number of parameters, HDAM has half the parameters of the attention network SE-ResNet-101 [11] and CBAM-ResNet-101 [12], which means that HDAM saves nearly half of the parameters and achieves higher performance.

Conclusion and Future Work
We propose a new attention mechanism module HDAM based on heuristics search and differences between the local and global CI.This module calculates global and local CI at the same time, but unlike any previous attention mechanism, HDAM does not use local or global CI to recalibrate the input, but calculates the difference between the two and recalibrates the input with the difference.In addition, to design a more reasonable local receptive field size, we first introduce heuristic search into the attention mechanism design.We encode the local receptive field of each convolutional layer into individuals, and use GA to search for the most suitable combination of local receptive fields.We use ResNet-50 as the base model to embed HDAM, and test HDAM on CIFAR-10 and CIFAR-100, respectively, and compare with four types of CNNs, including classic and state-of-the-art.The results show that HDAM surpasses all the above models on CIFAR-10 and CIFAR-100.Compared with the most popular attention mechanism-based models, HDAM can use nearly half of the parameters to obtain higher accuracy.For the future work, we will use weight inheritance to reduce the time spent searching for local receptive fields.
Table 2: Comparison between the proposed HDAM and the state-of-the-art peer competitors in terms of the classification accuracy, the number of parameters on the dataset CIFAR-10 [18] and CIFAR-100 [18].

Figure 1 :
Figure 1: An encoded individual.The numbers in the dashed lines indicate other size options in a residual block.