TY  - EJOU
AU  - Yan, Feng 
AU  - Silamu, Wushouer 
AU  - Li, Yanbing 

TI  - MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering
T2  - Computers, Materials \& Continua

PY  - 2023
VL  - 76
IS  - 1
SN  - 1546-2226

AB  - Visual question answering (VQA) requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately. However, existing models tend to ignore the implicit knowledge in the images and focus only on the visual information in the images, which limits the understanding depth of the image content. The images contain more than just visual objects, some images contain textual information about the scene, and slightly more complex images contain relationships between individual visual objects. Firstly, this paper proposes a model using image description for feature enhancement. This model encodes images and their descriptions separately based on the question-guided co-attention mechanism. This mechanism increases the feature representation of the model, enhancing the model’s ability for reasoning. In addition, this paper improves the bottom-up attention model by obtaining two image region features. After obtaining the two visual features and the spatial position information corresponding to each feature, concatenating the two features as the final image feature can better represent an image. Finally, the obtained spatial position information is processed to enable the model to perceive the size and relative position of each object in the image. Our best single model delivers a 74.16% overall accuracy on the VQA 2.0 dataset, our model even outperforms some multi-modal pre-training models with fewer images and a shorter time.
KW  - Bottom-up attention; spatial position relationship; region feature; self-attention

DO  - 10.32604/cmc.2023.038177