TY - EJOU
AU - Alamri, Faten S.
AU - Ayesha, Noor
AU - Zafar, Afia
AU - Saleem, Adil Ali
AU - Khan, Amjad R.
TI - Multimodal Graph-Enhanced Vision Transformer for Interpretable Skin Lesion Classification
T2 - Computer Modeling in Engineering \& Sciences
PY - 2026
VL - 147
IS - 1
SN - 1526-1506
AB - The use of automated skin lesion classification is still a disadvantage, since there is a great visual similarity between benign and malignant lesions. The majority of deep learning methods utilize dermoscopic images only, without taking into account clinical metadata employed by dermatologists on a regular basis. The following paper proposes a vision-graph multimodal framework that links Image encoding to graph neural networks based on metadata representation through the fusion of learnable attention. The framework focuses on three limitations, which are underutilization of clinical context, absence of interpretability, and suboptimal incorporation of modalities. Gradient-weighted Class Activation Mapping++ (Grad-CAM++) is used to obtain dual explainability of visual attention, and SHapley Additive exPlanations (SHAP) to obtain feature importance. Examining the HAM10000 and Derm7pt datasets, statistically significant advances (p < 0.001) of 89.3% and 92.1% accuracy are obtained, which is 4.1% and 2.7% higher than baselines that can only use images. Focusing on weight analysis will provide metadata with 37.7% averaged variance with an error of 8.4%, which confirms the clinical importance of multimodal modeling. The study of ablation shows that graph-based metadata encoding is 1.4% better than standard multilayer perceptron encoding (p = 0.003).
KW - Skin lesion classification; vision transformer; graph neural network; multimodal learning; explainable AI; medical image analysis
DO - 10.32604/cmes.2026.080335