Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion

Huansha Wang; Ruiyang Huang; Qinrang Liu; Xinghao Wang

doi:10.32604/cmc.2025.061902

Open Access icon Open Access

ARTICLE

Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion

Huansha Wang^*, Ruiyang Huang^*, Qinrang Liu, Xinghao Wang

National Digital Switching System Engineering & Technological R&D Center, Information Engineering University, Zhengzhou, 450001, China

* Corresponding Authors: Huansha Wang. Email: email ; Ruiyang Huang. Email: email

Computers, Materials & Continua 2025, 83(3), 5747-5760. https://doi.org/10.32604/cmc.2025.061902

Received 05 December 2024; Accepted 26 March 2025; Issue published 19 May 2025

Abstract

Multi-modal Named Entity Recognition (MNER) aims to better identify meaningful textual entities by integrating information from images. Previous work has focused on extracting visual semantics at a fine-grained level, or obtaining entity related external knowledge from knowledge bases or Large Language Models (LLMs). However, these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches. In this paper, we present MMAVK, a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion, which aims to leverage the Multi-modal Large Language Model (MLLM) as an implicit knowledge base. It also extracts vision-based auxiliary knowledge from the image for more accurate and effective recognition. Specifically, we propose vision-based auxiliary knowledge generation, which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts, thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs. Furthermore, we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformer-based encoder. Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets, even when the large models employed have significantly fewer parameters than other baselines.

Keywords

Multi-modal named entity recognition; large language model; multi-modal fusion

Cite This Article

APA Style

Wang, H., Huang, R., Liu, Q., Wang, X. (2025). Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion. Computers, Materials & Continua, 83(3), 5747–5760. https://doi.org/10.32604/cmc.2025.061902

Vancouver Style

Wang H, Huang R, Liu Q, Wang X. Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion. Comput Mater Contin. 2025;83(3):5747–5760. https://doi.org/10.32604/cmc.2025.061902

IEEE Style

H. Wang, R. Huang, Q. Liu, and X. Wang, “Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion,” Comput. Mater. Contin., vol. 83, no. 3, pp. 5747–5760, 2025. https://doi.org/10.32604/cmc.2025.061902

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Multi-Modal Named Entity Recognition with Auxiliary Visual Knowledge and Word-Level Fusion

Abstract

Keywords

Cite This Article

874

387

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link