Open Access
ARTICLE
Multi-Scale Feature Fusion and Advanced Representation Learning for Multi Label Image Classification
1 Institute of Artificial Intelligence on Education Research, College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, 200234, China
2 Lab for Educational Big Data and Policymaking, Ministry of Education, Shanghai Normal University, Shanghai, 200234, China
3 Shanghai Intelligent Education Big Data Engineering Technology Research Center, Shanghai Normal University, Shanghai, 200234, China
4 Shanghai Online Education Research Base for Primary and Secondary Schools, Shanghai, 200234, China
5 DS Information Technology Co., Ltd., Shanghai, 200032, China
6 Faculty of Innovation Engineering, Macau university of Science and Technology, Macau, 999078, China
* Corresponding Author: Xiao Lin. Email:
(This article belongs to the Special Issue: The Latest Deep Learning Architectures for Artificial Intelligence Applications)
Computers, Materials & Continua 2025, 82(3), 5285-5306. https://doi.org/10.32604/cmc.2025.059102
Received 28 September 2024; Accepted 10 January 2025; Issue published 06 March 2025
Abstract
Multi-label image classification is a challenging task due to the diverse sizes and complex backgrounds of objects in images. Obtaining class-specific precise representations at different scales is a key aspect of feature representation. However, existing methods often rely on the single-scale deep feature, neglecting shallow and deeper layer features, which poses challenges when predicting objects of varying scales within the same image. Although some studies have explored multi-scale features, they rarely address the flow of information between scales or efficiently obtain class-specific precise representations for features at different scales. To address these issues, we propose a two-stage, three-branch Transformer-based framework. The first stage incorporates multi-scale image feature extraction and hierarchical scale attention. This design enables the model to consider objects at various scales while enhancing the flow of information across different feature scales, improving the model’s generalization to diverse object scales. The second stage includes a global feature enhancement module and a region selection module. The global feature enhancement module strengthens interconnections between different image regions, mitigating the issue of incomplete representations, while the region selection module models the cross-modal relationships between image features and labels. Together, these components enable the efficient acquisition of class-specific precise feature representations. Extensive experiments on public datasets, including COCO2014, VOC2007, and VOC2012, demonstrate the effectiveness of our proposed method. Our approach achieves consistent performance gains of 0.3%, 0.4%, and 0.2% over state-of-the-art methods on the three datasets, respectively. These results validate the reliability and superiority of our approach for multi-label image classification.Keywords
Cite This Article

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.