Mitigating Visual Noise in Multimodal AI: Selective Visual Grounding for Multimodal Machine Translation

Ki-Young Shin¹, Soonmo Kwon², Kyudong Park^3,*
1 Designovel Lab, Designovel, Pohang, Republic of Korea
2 Department of Convergence IT Engineering, POSTECH, Pohang, Republic of Korea
3 School of Information Convergence, Kwangwoon University, Seoul, Republic of Korea
* Corresponding Author: Kyudong Park. Email: email

Computer Modeling in Engineering & Sciences https://doi.org/10.32604/cmes.2026.083410

Received 03 April 2026; Accepted 11 June 2026; Published online 29 June 2026

Download PDF

Abstract

Multimodal AI systems often suffer from “over-informing”, where excessive raw visual input introduces noise that distracts from task-relevant decisions. Motivated by selective human attention strategies, we propose ARS-MMT (Attention and Reasoning through Source Sentences for Multimodal Machine Translation), an architecture that operationalizes a “look-and-think” pipeline: a source-language encoder first builds contextualized linguistic representations, a relation reasoning network then produces a query-conditioned visual channel, and a multimodal decoder generates the translation conditioned in parallel on the encoded text and on this visual channel. We quantify the contribution of the visual modality through a controlled ablation: zeroing visual features reduces BLEU by 0.81 on test_2016_flickr En-De, while shuffling visual features across the batch changes BLEU by only +0.01, indicating that the channel responds primarily to the presence of visual context rather than to its image-specific content. We additionally add a contemporary 7B-parameter vision–language baseline (LLaVA-1.5) and show that our compact 4.3M-parameter specialized model is competitive in-domain. To address the open question of whether per-region visual attention constitutes a faithful explanation in the multimodal-translation setting, we conduct a deletion/insertion AUC analysis and report a null result consistent with prior findings on text attention. We therefore characterize ARS-MMT as an architecture whose modality-level visual contribution is measurable but whose per-region attention is not by itself a faithful explanation; faithful per-region attribution is identified as a target for complementary explanation methods. We discuss implications for efficient and inspectable multimodal systems in engineering deployment.