Open Access iconOpen Access

ARTICLE

Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion

Huansha Wang*, Ruiyang Huang*, Qinrang Liu, Xinghao Wang

National Digital Switching System Engineering & Technological R&D Center, Information Engineering University, Zhengzhou, 450001, China

* Corresponding Authors: Huansha Wang. Email: email; Ruiyang Huang. Email: email

Computers, Materials & Continua 2025, 83(3), 5747-5760. https://doi.org/10.32604/cmc.2025.061902

Abstract

Multi-modal Named Entity Recognition (MNER) aims to better identify meaningful textual entities by integrating information from images. Previous work has focused on extracting visual semantics at a fine-grained level, or obtaining entity related external knowledge from knowledge bases or Large Language Models (LLMs). However, these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches. In this paper, we present MMAVK, a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion, which aims to leverage the Multi-modal Large Language Model (MLLM) as an implicit knowledge base. It also extracts vision-based auxiliary knowledge from the image for more accurate and effective recognition. Specifically, we propose vision-based auxiliary knowledge generation, which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts, thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs. Furthermore, we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformer-based encoder. Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets, even when the large models employed have significantly fewer parameters than other baselines.

Keywords

Multi-modal named entity recognition; large language model; multi-modal fusion

Cite This Article

APA Style
Wang, H., Huang, R., Liu, Q., Wang, X. (2025). Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion. Computers, Materials & Continua, 83(3), 5747–5760. https://doi.org/10.32604/cmc.2025.061902
Vancouver Style
Wang H, Huang R, Liu Q, Wang X. Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion. Comput Mater Contin. 2025;83(3):5747–5760. https://doi.org/10.32604/cmc.2025.061902
IEEE Style
H. Wang, R. Huang, Q. Liu, and X. Wang, “Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion,” Comput. Mater. Contin., vol. 83, no. 3, pp. 5747–5760, 2025. https://doi.org/10.32604/cmc.2025.061902



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 295

    View

  • 118

    Download

  • 0

    Like

Share Link