Yaru Song*, Huahu Xu, Dikai Fang
Intelligent Automation & Soft Computing, Vol.39, No.3, pp. 397-416, 2024, DOI:10.32604/iasc.2023.040521
Abstract Visual Question Answering (VQA) has sparked widespread interest as a crucial task in integrating vision and language. VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions. The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively. However, it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details, which is the advantage of grid-based features. In… More >