TY  - EJOU
AU  - Shin, Ki-Young 
AU  - Kwon, Soonmo 
AU  - Park, Kyudong 

TI  - Mitigating Visual Noise in Multimodal AI: Selective Visual Grounding for Multimodal Machine Translation
T2  - Computer Modeling in Engineering \& Sciences

PY  - 
VL  - 
IS  - 
SN  - 1526-1506

AB  - Multimodal AI systems often suffer from “over-informing”, where excessive raw visual input introduces noise that distracts from task-relevant decisions. Motivated by selective human attention strategies, we propose ARS-MMT (Attention and Reasoning through Source Sentences for Multimodal Machine Translation), an architecture that operationalizes a “look-and-think” pipeline: a source-language encoder first builds contextualized linguistic representations, a relation reasoning network then produces a query-conditioned visual channel, and a multimodal decoder generates the translation conditioned in parallel on the encoded text and on this visual channel. We quantify the contribution of the visual modality through a controlled ablation: zeroing visual features reduces BLEU by 0.81 on test_2016_flickr En-De, while shuffling visual features across the batch changes BLEU by only <mml:math id="mml-ieqn-1"><mml:mo>+</mml:mo><mml:mn>0.01</mml:mn></mml:math>, indicating that the channel responds primarily to the <i>presence</i> of visual context rather than to its image-specific content. We additionally add a contemporary 7B-parameter vision–language baseline (LLaVA-1.5) and show that our compact 4.3M-parameter specialized model is competitive in-domain. To address the open question of whether per-region visual attention constitutes a faithful explanation in the multimodal-translation setting, we conduct a deletion/insertion AUC analysis and report a null result consistent with prior findings on text attention. We therefore characterize ARS-MMT as an architecture whose modality-level visual contribution is measurable but whose per-region attention is not by itself a faithful explanation; faithful per-region attribution is identified as a target for complementary explanation methods. We discuss implications for efficient and inspectable multimodal systems in engineering deployment.
KW  - Explainable AI; multimodal machine translation; selective visual grounding; attention faithfulness; modality ablation; vision–language models

DO  - 10.32604/cmes.2026.083410