TY  - EJOU
AU  - Dewi, Christine 
AU  - Chernovita, Hanna Prillysca 
AU  - Philemon, Stephen Abednego 
AU  - Ananta, Christian Adi 
AU  - Chen, Abbott Po Shun 

TI  - Adjusted Reasoning Module for Deep Visual Question Answering Using Vision Transformer
T2  - Computers, Materials \& Continua

PY  - 2024
VL  - 81
IS  - 3
SN  - 1546-2226

AB  - Visual Question Answering (VQA) is an interdisciplinary artificial intelligence (AI) activity that integrates computer vision and natural language processing. Its purpose is to empower machines to respond to questions by utilizing visual information. A VQA system typically takes an image and a natural language query as input and produces a textual answer as output. One major obstacle in VQA is identifying a successful method to extract and merge textual and visual data. We examine “Fusion” Models that use information from both the text encoder and picture encoder to efficiently perform the visual question-answering challenge. For the transformer model, we utilize BERT and RoBERTa, which analyze textual data. The image encoder designed for processing image data utilizes ViT (Vision Transformer), Deit (Data-efficient Image Transformer), and BeIT (Image Transformers). The reasoning module of VQA was updated and layer normalization was incorporated to enhance the performance outcome of our effort. In comparison to the results of previous research, our proposed method suggests a substantial enhancement in efficacy. Our experiment obtained a 60.4% accuracy with the PathVQA dataset and a 69.2% accuracy with the VizWiz dataset.
KW  - VQA; vision transformer; multimodal data; deep learning

DO  - 10.32604/cmc.2024.057453