TY - EJOU
AU - Shin, Ki-Young
AU - Kwon, Soonmo
AU - Park, Kyudong
TI - Mitigating Visual Noise in Multimodal AI: Selective Visual Grounding for Multimodal Machine Translation
T2 - Computer Modeling in Engineering \& Sciences
PY -
VL -
IS -
SN - 1526-1506
AB - Multimodal AI systems often suffer from “over-informing”, where excessive raw visual input introduces noise that distracts from task-relevant decisions. Motivated by selective human attention strategies, we propose ARS-MMT (Attention and Reasoning through Source Sentences for Multimodal Machine Translation), an architecture that operationalizes a “look-and-think” pipeline: a source-language encoder first builds contextualized linguistic representations, a relation reasoning network then produces a query-conditioned visual channel, and a multimodal decoder generates the translation conditioned in parallel on the encoded text and on this visual channel. We quantify the contribution of the visual modality through a controlled ablation: zeroing visual features reduces BLEU by 0.81 on test_2016_flickr En-De, while shuffling visual features across the batch changes BLEU by only +0.01, indicating that the channel responds primarily to the presence of visual context rather than to its image-specific content. We additionally add a contemporary 7B-parameter vision–language baseline (LLaVA-1.5) and show that our compact 4.3M-parameter specialized model is competitive in-domain. To address the open question of whether per-region visual attention constitutes a faithful explanation in the multimodal-translation setting, we conduct a deletion/insertion AUC analysis and report a null result consistent with prior findings on text attention. We therefore characterize ARS-MMT as an architecture whose modality-level visual contribution is measurable but whose per-region attention is not by itself a faithful explanation; faithful per-region attribution is identified as a target for complementary explanation methods. We discuss implications for efficient and inspectable multimodal systems in engineering deployment.
KW - Explainable AI; multimodal machine translation; selective visual grounding; attention faithfulness; modality ablation; vision–language models
DO - 10.32604/cmes.2026.083410