Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

Xiliang Zhang; Jin Liu; Yue Li; Zhongdai Wu; Y. Wang

doi:10.32604/cmc.2022.027097

Open Access icon Open Access

ARTICLE

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

Xiliang Zhang¹, Jin Liu^1,*, Yue Li¹, Zhongdai Wu^2,3, Y. Ken Wang⁴

1 College of Information Engineering, Shanghai Maritime University, Shanghai, China
2 Shanghai Ship and Shipping Research Institute, Shanghai, China
3 COSCO Shipping Technology Co., LTD, Shanghai, China
4 Division of Management and Education, University of Pittsburgh, Bradford, USA

* Corresponding Author: Jin Liu. Email: email

Computers, Materials & Continua 2022, 73(3), 6407-6424. https://doi.org/10.32604/cmc.2022.027097

Received 10 January 2022; Accepted 14 April 2022; Issue published 28 July 2022

Abstract

Performance of Video Question and Answer (VQA) systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’ answers. However, traditional linear combinations of multimodal features focus only on shallow feature interactions, fall far short of the need of deep feature fusion. Attention mechanisms were used to perform deep fusion, but most of them can only process weight assignment of single-modal information, leading to attention imbalance for different modalities. To address above problems, we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion (TMCF) and Self-Adaptive Multimodal Balancing Mechanism (SAMB). Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing. In addition, TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain. Extensive experiments were conducted on MSVD-QA and MSRVTT-QA datasets. The results confirm the advantages of our approach in handling multimodal tasks. Besides, we also provide analyses for ablation studies to verify the effectiveness of each proposed component.

Keywords

Video question and answer systems; feature fusion; scaling matrix; attention mechanism

Cite This Article

APA Style

Zhang, X., Liu, J., Li, Y., Wu, Z., Ken Wang, Y. (2022). Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems. Computers, Materials & Continua, 73(3), 6407–6424. https://doi.org/10.32604/cmc.2022.027097

Vancouver Style

Zhang X, Liu J, Li Y, Wu Z, Ken Wang Y. Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems. Comput Mater Contin. 2022;73(3):6407–6424. https://doi.org/10.32604/cmc.2022.027097

IEEE Style

X. Zhang, J. Liu, Y. Li, Z. Wu, and Y. Ken Wang, “Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems,” Comput. Mater. Contin., vol. 73, no. 3, pp. 6407–6424, 2022. https://doi.org/10.32604/cmc.2022.027097

BibTex EndNote RIS

Copyright © 2022 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

Abstract

Keywords

Cite This Article

1479

961

2

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link