WMA: A Multi-Scale Self-Attention Feature Extraction Network Based on Weight Sharing for VQA

Visual Question Answering (VQA) has attracted extensive research focus and has become a hot topic in deep learning recently. The development of computer vision and natural language processing technology has contributed to the advancement of this research area. Key solutions to improve the performance of VQA system exist in feature extraction, multimodal fusion, and answer prediction modules. There exists an unsolved issue in the popular VQA image feature extraction module that extracts the fine-grained features from objects of different scale difficultly. In this paper, a novel feature extraction network that combines multi-scale convolution and self-attention branches to solve the above problem is designed. Our approach achieves the state-of-the-art performance of a single model on Pascal VOC 2012, VQA 1.0, and VQA 2.0 datasets.


Introduction
In recent years, with the development of artificial intelligence technology, methods based on deep learning have achieved great success in computer vision and natural language processing. Traditionally, these two research fields are mutually independent, but multimodal learning for language and vision has gained much attention recently such as Visual Question Answer (VQA) [1], image text retrieval [2][3], and image captioning [4][5]. Among many multimodal machine learning tasks, VQA learns to infer the answer given a real-world image and a question in natural language about the visual content of this image. Thus, VQA is challenging, because it requires a simultaneous understanding of both visual content of images and textual content of questions.
In this paper, we argue that such an effective fine-grained feature extraction module is crucial for VQA. Most existing approaches construct an effective and heavy network to extract discriminative features for image. However, previous methods based on complex DNNs cannot achieve the model of lightweight. The bottom-up attention mechanism [6] is the most straightforward and common solutions to learn discriminative features and has been successively applied in VQA tasks. The given question may strongly relate to only a small part of the image. Therefore, it is intuitive to introduce the bottom-up attention mechanism based on Faster R-CNN [7] into the VQA task to adaptively learn the most relevant image regions for a given question. On the other hand, the features extracted from CNNs with the same structure are the same, but blindly adding different CNNs may make the model more complex and redundant. And single CNNs are insensitive to different scale objects. To tackle above problem effectively, multi-scale deep network is introduced. For instance, Single Shot MultiBox Detector (SSD) [8] and Feature Pyramid Networks (FPN) [9] are the most popular multi-scale models. Motivated by these observations, we propose feature extraction as a combination of self-attention mechanism and multi-scale deep network, which considers two or more advantages. In order to demonstrate the excellence of our designed module, we conduct various ablation experiments and comparative experiments in the following part of this article.

Visual Question Answering
Although the field of computer vision and question answering systems based on natural language processing has been developed for nearly half a century, the concept of visual question answering systems was formally proposed in 2015 [1]. The visual question answering system is a multi-modal fusion system that integrates images and text. It usually includes sub-task modules such as image feature extraction, problem semantic analysis, and multi-modal feature fusion.
In recent years, the networks using CNNs for deep and wide feature extraction have also gradually enriched with the development of deep learning technology [10][11]. The VGG network proposed by Simonyan et al. [12] has become one of the most popular networks by relying on the constructed deep convolutional network. The ResNet feature extraction network proposed by He et al. [13] solves the phenomenon of gradient disappearance caused by the increase of the depth of the convolutional network, which lays the foundation for the current feature extraction network. In addition, two-stage detection such as Faster R-CNN designed by Ren et al. [14] extracts target region proposal frames by adding featureshared RPN networks with little increase in complexity, which greatly improves the accuracy of the network object detection. The one-stage network structure RetinaNet proposed by Lin et al. [15] solves the difficulty of class imbalance in the one-stage structure network through the designed Focal Loss function, which makes the network architecture more lightweight and makes speed and accuracy more good balance.
Motivated by these advanced approaches, image feature extraction module can extract enough feature information from the input image of VQA. However, since only a few important regions of the image are required to answer the VQA question, complex object detection models may make the extracted feature information so redundant that resulting in waste of resources. Therefore, we introduced attention mechanism in our approaches.

Attention Mechanism
In the field of natural language processing, the attention mechanism was first proposed by Bahdanau et al. [16] in 2014. This method allows the model to focus the training on the relevant parts of the input data instead of the irrelevant parts, thereby improving the speed and quality of extracting key information. With the continuous development of deep learning, the field of computer vision also needs to build a neural network with an attention mechanism [16][17][18][19][20][21][22]. Through the attention mechanism, the neural network can increase the attention of key regions.
The visual attention mechanism is mostly formed by using masks intensively. The neural network focuses on the most critical parts of the picture by learning the key image features marked by the mask. Jaderberg et al. [17] proposed a network module of a spatial transformer through the attention mechanism in the spatial domain, which transformed the spatial information in the original image into another space and retained the key information. In the channel domain application, the SENet proposed by Hu et al. [18] increased the weight by using the image channel's contribution to key information during the convolution process.
Just recently, Wang et al. [22] innovatively proposed a model combining bottom-up and top-down attention mechanisms for Image Captioning and Visual Question Answering. The model based on topdown attention mechanism is used to learn the weights corresponding to features (generally using LSTMs [23]) to deeply understand the visual images. In other words, bottom-up attention is to extract some important regions of the picture, and each area is represented by a feature vector. Top-down attention is to determine the degree of contribution of features to the text, and then extract the features of the saliency regions that have a large contribution to the description. In addition, Huang et al. [24] proposed a nonlocal information statistical attention mechanism "self-attention" based on capturing the dependencies between long-distance features. Motivated by these works, we added this attention mechanism to the structure we designed to make the model focus more on the objects in the graph.

Method
In this section, we first introduce the residual network module, and then describe in detail the multiscale convolution network based on weight sharing to which this technology is applied. In addition, we also added the visual attention mechanism to the feature extraction to obtain a new feature extraction network for VQA. Finally, we also proposed a WMA network-based target detection model WMA R-CNN and introduced its construction in detail. For the image feature extraction task of VQA, we define the problem as follows: (1) where is a given input image and m is the number of pixels in this image. And feature extraction network outputs the image feature result � = � � 1 , � 2 , . . . , � � after convolution pooling according to the input image I, where n is the number of points on the image divided into different subsets by the feature extraction network. And we define the features as follows: where is the image feature set extracted by the feature extraction network. Our ultimate target is to enable the model to deal with objects of different sizes while ensuring that the model is lightweight. And it can be sensitive to important objects in the image, so as to avoid wasting computing resources.

Multi-Scale Convolutional Module
We always hope that the network can obtain the most accurate features with the least resources when use feature extraction networks in object detection tasks. From the perspective of the input image, we hope to reduce the impact of the background of the image during feature extraction and classify objects of different sizes without causing gradient explosions. Not only the input image has a series of factors such as uneven illumination intensity, complex image background and high noise interference in the process of actual image feature extraction, but the feature extraction network cannot satisfy the visual question answering system in agility and efficiency requirements. We propose a new method for image feature extraction. This method cannot only perform deeper feature extraction on the image at different data granularities, but also perform detailed analysis on the image at the microscopic level to exclude various interference factors, so that key information in the image can be efficiently extracted.
As shown in Fig. 1, we set up convolution kernels with receptive fields of sizes 1 × 1, 3 × 3 and 5 × 5 respectively by analyzing the features of the input image. The network can obtain detailed characteristics of objects of different sizes by setting three parallel branches of convolution kernels of different sizes. In addition, we also split the 3 × 3 and 5 × 5 convolution kernels into 1 × 3 and 3 × 1, 1 × 5 and 5 × 1 convolution kernels. Convolution kernels of size 1 × n and n × 1 have the same receptive field as n × n, and the former will have fewer parameters. Therefore, this architecture will further enhance the agility of our network.
We add dilated convolutions of sizes 1, 2, and 3 to the convolution kernels of 1 × 1, 3 × 3 and 5 × 5. It can control the range of the receptive field of the network through different expansion rates. We assume that the current feature map operation step is s, the dilated convolution of the convolutional layer with an expansion rate of d can increase the receptive field range by 2 ⋅ ( − 1) × times. Correspondingly, the dilated convolution of the n layers expansion rate of d can be Increase the receptive field range by ( − 1) × × times. Therefore, the sensitivity of feature extraction network for different sizes of objects is enhanced by expanding convolution, and more comprehensive feature information is captured.

Multi-Scale Self-Attention Convolutional Network Based on Weight-Sharing (WMA)
Our designed multi-scale lightweight feature extraction network based on weight sharing is an image feature extraction algorithm designed based on convolutional neural networks. It ensures that the number of network parameters is not too high by combining the weight-sharing structure and the multi-scale convolutional layer. However, the reduction in the number of parameters will also make the image features extracted by the network inaccurate. In order to solve this problem, we add the improved visual attention mechanism to extract the key information features in the image, so that the network can focus on the information features of the required objects like the human brain, and then extract more critical feature information. In addition, self-attention directly calculates the relationship between any two pixels in the image to obtain the global geometric features of the image in one step. And it allows the model to learn the dependencies better between global features. The input of the WMA feature extraction network is = { 1 , 2 , . . . , }, and the three outputs after the multi-scale convolution network are 1×1 ∈ × , 3×3 ∈ × , 5×5 ∈ × . Then these outputs are sent to the self-attention branch, and finally the results are accumulated to obtain the output of the WMA network � = { � 1 , � 2 , . . . , � } ∈ × . The � can be calculated as follows: where ̃ is the number of branches of the multi-scale convolutional network and is a learnable scaler.

Experiment
We mainly use the PASCAL VOC 2012 [25] datasets to evaluate the individual object detection capabilities of our designed model, and evaluate the feature extraction performance of our designed model on the VQA system by using the VQA 1.0 and VQA 2.0 datasets [26]. The testing environment was conducted on one single Nvidia GTX 2080Ti graphic card with 16 GB memory,and an Intel(R) Core(TM) i7-7700 3.60 GHz.The ablation experiments of the model are shown in Table 1.
As shown in Table 1 and Fig. 2, the training effects of network models with different configurations are different. Pre-train represents the basic feature extraction network used by each network, multi-branch represents whether the model uses a multi-scale feature extraction network and what form of multi-scale feature extraction network is used. Weight-sharing represents whether weight sharing is adopted mechanism, Attention represents whether the model contains a self-attention mechanism. Baseline is used as the baseline control network. This network uses VGG16 as the basic feature extraction network without multi-scale feature extraction network and weight sharing mechanism, and its final mAP is 65.7%. In addition, seven other models are respectively comparative experiments set up to detect the Pre-train, Multi-branch, Weight-sharing mechanism and Attention mechanism. When we compare Model 3 and Model 6, it can be seen that the model can obviously learn more image information features when ResNet152 is adopted. Compared with Model 2 and Model 3 or Model 5 and Model 6, we can find that the model using the weight-sharing mechanism can achieve better results. Because this mechanism shares the weight parameters learned between different branches in the multi-scale feature extraction network, which also allows the model to learn more image feature information in the limited parameters. And it also makes the detection effect improved. When comparing Model 1 and Model 2 or Model 4 and Model 5, we find that the model is not sensitive to the feature information of different sizes of objects without applying 5 × 5 branches. On the whole, network with ResNet is better than VGG in feature extraction. Comparing Model 6 and Model 7, we can find that the self-attention mechanism can also improve the detection performance of the model. Because the attention mechanism can make the model pay more attention to the key regions in the image and reduce unnecessary waste of resources to improve the detection effect. The overall detection score mAP on the public Pascal VOC 2012 dataset reached 73.6%, which also confirms the robustness of our proposed image feature extraction method WMA.  In order to further measure the performance of our model, we compare the VQA model with WMA technology to the most widely used VQA models in recent years. As shown in Table 2, we performed answer sampling on the VQA 2.0 dataset. Because each image-question pair is annotated with 10 standard answers in the VQA dataset, for each question we only keep the answers that appear more than three times to enhance the validity of the dataset. In addition, we also used the Visual Genome dataset as an extended dataset to train our model. This dataset is three times the size of the model training set. The addition of this dataset greatly enhances the richness of our dataset. Table 2 shows that the bilinear models MCB and MLB have strong advantages compared with other simple VQA models. The accuracy of the VQA model increased by 1.09% and 1.12% respectively when we replaced the feature extraction modules in MCB and MLB with the WMA network. This also shows that the WMA network we designed has the ability to extract fine-grained features, and can focus on key regions in the image. Therefore, it can lay a good foundation for the subsequent feature fusion step and improve the overall efficiency of the VQA model.  Typical examples are shown in Fig. 3. In each example, the left is the original picture, the right is the heat map generated by the intermediate steps of the WMA model, and the bottom is the question-andanswer situation. The VQA model using Faster R-CNN does not correctly identify the number of birds standing on the wall in Fig. 3.

Conclusion
In this paper, we design a multi-scale self-attention object detection model based on weight-sharing for VQA (WMA), which can extract fine-grained features of different sizes objects and focus on key regions in the image. So we can improve the overall accuracy of the VQA system. In addition, our designed WMA can also be applied to other object detection tasks as a general feature extraction network. Our experimental results on Pascal VOC 2012, VQA 1.0 and VQA 2.0 datasets demonstrate the effectiveness and robustness of WMA module.
Funding Statement: This work is supported by the National Natural Science Foundation of China (61872231, 61701297).

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.