TY - EJOU AU - Alamri, Faten S. AU - Ayesha, Noor AU - Zafar, Afia AU - Saleem, Adil Ali AU - Khan, Amjad R. TI - Multimodal Graph-Enhanced Vision Transformer for Interpretable Skin Lesion Classification T2 - Computer Modeling in Engineering \& Sciences PY - 2026 VL - 147 IS - 1 SN - 1526-1506 AB - The use of automated skin lesion classification is still a disadvantage, since there is a great visual similarity between benign and malignant lesions. The majority of deep learning methods utilize dermoscopic images only, without taking into account clinical metadata employed by dermatologists on a regular basis. The following paper proposes a vision-graph multimodal framework that links Image encoding to graph neural networks based on metadata representation through the fusion of learnable attention. The framework focuses on three limitations, which are underutilization of clinical context, absence of interpretability, and suboptimal incorporation of modalities. Gradient-weighted Class Activation Mapping++ (Grad-CAM++) is used to obtain dual explainability of visual attention, and SHapley Additive exPlanations (SHAP) to obtain feature importance. Examining the HAM10000 and Derm7pt datasets, statistically significant advances (p < 0.001) of 89.3% and 92.1% accuracy are obtained, which is 4.1% and 2.7% higher than baselines that can only use images. Focusing on weight analysis will provide metadata with 37.7% averaged variance with an error of 8.4%, which confirms the clinical importance of multimodal modeling. The study of ablation shows that graph-based metadata encoding is 1.4% better than standard multilayer perceptron encoding (p = 0.003). KW - Skin lesion classification; vision transformer; graph neural network; multimodal learning; explainable AI; medical image analysis DO - 10.32604/cmes.2026.080335