An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI

Mamoona Jabbar; Uzma Jamil; Muhammad Younas; Bushra Zafar

doi:10.32604/cmes.2026.080241

icon Open Access

ARTICLE

An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI

Mamoona Jabbar, Uzma Jamil^*, Muhammad Younas, Bushra Zafar

Department of Computer Science, Government College University, Faisalabad, Pakistan

* Corresponding Author: Uzma Jamil. Email: email

(This article belongs to the Special Issue: Artificial Intelligence Models in Healthcare: Challenges, Methods, and Applications)

Computer Modeling in Engineering & Sciences 2026, 147(1), 40 https://doi.org/10.32604/cmes.2026.080241

Received 05 February 2026; Accepted 25 March 2026; Issue published 27 April 2026

Abstract

Magnetic Resonance Imaging is one of the most commonly used neuro-oncology imaging modalities, which is a non-invasive mode of imaging and helps in detecting brain abnormalities in an effective way. Earlier researchers have demonstrated that brain tumor segmentation and classification can be effectively performed using deep learning techniques. Existing studies are primarily aimed at increasing prediction accuracy and provide insignificant consideration to model interpretability, limiting their practical application in clinical practice. To address this limitation, this research presents a two-stage explainable deep learning model, which combines transformer-based segmentation with an ensemble classification model that is consistent in explanations. The first stage introduces Swin-DS-HAFUNetv2, an enhanced transformer-based segmentation architecture that integrates hierarchical Swin Transformer encoders, refined hierarchical attention fusion, a contextual bottleneck transformer, and multi-scale deep supervision to improve tumor localization in T1-weighted MRI, particularly under low-contrast and irregular morphological conditions. The second stage includes the ECWMEv2 ensemble classifier, which integrates a perturbation analysis based on Grad-CAM (Gradient-weighted Class Activation Mapping) to systematically assess the consistency and clinical significance of visual explanations for candidate models. Only those architectures that exhibit stable and pathology-consistent explanations, such as ConvNeXt, Swin Transformer, and EVA02, are stored and merged by means of explanation-weighted soft voting with XGBoost-based meta-learning. Experimental evaluation on the BRISC2025 benchmark dataset indicates that Swin-DS-HAFUNetv2 has a mean Dice coefficient of 0.9782 and Intersection over Union (IoU) of 0.8656, with ECWMEv2 having a classification accuracy of 0.9917 and a Macro-F1 score of 0.9867. The mean Grad-CAM IoU of 0.692 reflects uniform and anatomically significant consistency of attention to tumor regions. These results demonstrate that the integration of explanation stability as a fundamental design principle significantly improves model robustness and interpretability to provide a methodologically validated and benchmark-level framework for future studies of multimodal and clinically oriented brain tumor analysis systems.

Keywords

Brain tumor segmentation; classification; explainable artificial intelligence; Grad-CAM stability; medical image analysis; transformer networks

1 Introduction

Brain tumors can be considered one of the most aggressive and clinically complicated types of cancer that demand high levels of diagnostic accuracy to allow effective segmentation and classification of brain tumors to aid in treatment planning and prognosis evaluation [1]. The main type of imaging involved in neuro-oncology is Magnetic Resonance Imaging (MRI) because it is the only imaging modality that offers a superior soft-tissue contrast; it also has the capacity to represent anatomical and pathological structures in the brain without exposure to ionizing radiation [2]. Although it has its benefits, manual interpretation of MRI scans is time-consuming, labor-intensive, and prone to inter-observer variance, especially when dealing with tumor margins or non-uniform tissue appearance [3]. Such difficulties have increased the use of deep learning (DL) methods in automated brain tumor detection, segmentation, and classification [4].

Convolutional Neural Networks (CNNs) are deep learning models that have been shown to achieve high performance in medical image recognition because they are capable of learning hierarchical visual feature representations [5]. Tumor localization and classification tasks have also been performed with models based on Convolutional Neural Network (CNN), which has contributed to better reproducibility as well as eliminated the need to perform manual interpretation [6]. Nonetheless, standard CNN designs frequently fail to represent the role of long-range spatial interactions and so perform poorly when using large tumor volumes or low-contrast MRI volumes [7].

In order to mitigate such limitations, transformer-based architectures have recently become promising alternatives. Transformers can use self-attention processes to extract world contextual association in images and enhance the feature representation in challenging medical imaging tasks [8,9]. However, several current transformer-based segmentation systems are two-dimensional and do not use successful multi-scale feature fusion and deep supervision, limiting their inference to different tumor shapes and structures [10].

In classification, modern architectures, including CoAtNet, ConvNeXt, EVA02, Swin Transformer, and MaxViT, have been found to possess highly representative abilities, as compared to the standard CNN designs [11]. The models have better specificity in identifying subtypes of brain tumors and prognosis [12]. Nonetheless, segmentation and classification continue to be commonly considered as individual processes, with very few studies endeavoring to combine the two processes into common frameworks to perform automated brain tumor analysis [13].

The second major problem with medical artificial intelligence is the lack of interpretability. Most deep learning systems are black-box systems, and thus, a clinician can hardly comprehend how the predictions were made [14]. To solve this problem, explainable artificial intelligence (XAI) approaches like Gradient-weighted Class Activation mapping (Grad-CAM) and saliency mapping have been suggested to visualize image regions that affect model predictions [15]. Nevertheless, in the majority of studies, such explanations are not included in the modeling development or evaluation, but they are applied as post hoc visualization tools [16].

Recent studies identify an important but understudied determinant of XAI method reliability in medical imaging models, including explanation stability, or the consistency of saliency maps with small perturbations of inputs [17–19]. Though attention-based architectures like Swin Transformer have demonstrated increased robustness in MRI analysis [20], there are still numerous hybrid models that integrate deep learning and explanation consistency plus reliability, which lack formal quantitative evaluation mechanisms [21,22].

As a solution to these problems, this paper presents a proposal of an explainability-aware framework that unites tumor segmentation using a transformer and explanation-consistent ensemble classification in a unified pipeline. The proposed framework presents Swin-DS-HAFUNetv2, which is a transformer-based segmentation framework with hierarchical attention fusion and deep supervision to enhance the localization of the tumor boundary. Contrary to the current method of applying explainability to the post hoc visualization, an Explanation-Consistent Weighted Meta-Ensemble (ECWMEv2) classification framework takes into account Grad-CAM explanation stability as a quantitative metric in selecting a model and building an ensemble. The proposed solution should enhance both the model robustness and interpretability, as well as clinical reliability by adding consistency of the explanations as a quantitative measure of evaluation. The research objectives of this study are as follows.

OBJ 1. To improve segmentation accuracy for low-contrast brain MRI using hierarchical attention and deep supervision within transformer-based architectures.

OBJ 2. To improve the robustness and reliability of classification through an ensemble strategy guided by explanation consistency under input perturbations.

OBJ 3. To design an interpretable model selection and fusion process that incorporates quantifiable Grad-CAM stability metrics, including IoU, cosine similarity, and pass rate as selection criteria.

Overall, the key contributions of this work may be summarized in the following way:

• Enhanced Transformer-Based Segmentation Architecture: The work presents Swin-DS-HAFUNetv2, an improved transformer-based segmentation framework that incorporates Swin Transformer encoders, hierarchical attention fusion, and deep supervision to enhance the multi-scale contextual feature coverage and capture more precise tumor boundary segmentation in the brain MRI images.

• Explanation-Consistent Ensemble Classification Framework: A Grad-CAM explanation stability is added to an Explanation-Consistent Ensemble Classification Framework (ECWMEv2), where Grad-CAM explanation stability is used as a model selection and weighting criterion. The models of candidate classification are tested on the input perturbation, and only those that generate stable and clinically meaningful attention maps are stored in the ensemble. Explanation-weighted soft voting and XGBoost-based meta-learning layer are used to combine the retained models to improve the reliability and interpretability of the classification.

• Perturbation-Based Quantitative Evaluation of Explanation Stability: This research proposes a perturbation-based Grad-CAM consistency analysis, which is used to provide a formal assessment of explanation stability. The quantitative metrics of similarity, such as Intersection over Union (IoU), cosine similarity, and pass rate, are used to gauge the stability of the explanations, which allows the aspect of consistency of the explanations to be used as a formal evaluation and filtering mechanism by the model, but not as a visual interpretation tool.

The rest of this paper will be structured as follows. Section 2 provides a review of the corresponding literature concerning the same deep learning-based brain tumor segmentation, classification, and explainable artificial intelligence in medical imaging. Section 3 provides the suggested methodology, such as the Swin-DS-HAFUNetv2 segmentation model and explanation-consistent ECWMEv2 ensemble classification strategy. Section 4 outlines the experimental design and gives the findings and discussion of the suggested framework. Section 5 concludes the overall findings of the proposed framework, discusses the limitations, and future work.

2 Related Work

Extensive literature has been done on deep learning methodology on brain tumor segmentation and classification. The methods of CNN are still popular in the extraction of discriminative tumor characteristics of MRI images. It has been shown in a number of studies that deep convolutional architectures are effective in brain tumor classification and localization tasks [23–25]. Moreover, some works include auxiliary clinical data, like patient age or demographic characteristics, to add the context of diagnosis and improve the classification performance [26].

Various researchers have tried to integrate segmentation and classification in hybrid learning models. As an example, joint tumor segmentation and classification have been carried out by clustering-based and probabilistic neural network algorithms [27]. In other words, Grad-CAM or related XAI methods are used to visualize regions of interest in tumor classification [28]. Nonetheless, such explanations are normally given in a qualitative format and are seldom substantiated in the quantitative type of similarity, like Intersection over Union or cosine similarity. Deep learning models that are based on optimization have also been suggested to enhance classification performance, but they do not typically provide mechanisms to evaluate the explanation robustness [29,30].

XAI approaches based on attribution, like SHAP (SHapley Additive explanations) and LIME (Local Interpretable Model-agnostic explanations), and Grad-CAM offer explanations on a feature or pixel level interpretations, but cannot be applied to dense spatial medical images when the consistency of an explanation needs to be at a region level [31]. Despite the extensive application of data augmentation to enhance model generalization, its effect on explanation fidelity has been poorly studied [32]. Other classical methods of machine learning that include clustering and SVM-based classifiers have also been studied, but these models do not provide the interpretability and strength needed in clinical practice [33,34].

More recent works have investigated the hybrid designs of convolutional and transformer-based models to enhance the level of feature representation and classification accuracy [35–38]. However, explainability within these models is often confined to visualization, and the stability of explanation when there is a perturbation in inputs is not often considered. Neuro-oncology surveys have also shown that XAI methods have been used predominantly as a visual interpretation, and not in training or model validation activities [39–42].

Other medical imaging areas, such as stroke detection and pulmonary disease analysis, have also reported similar limitations, with the model interpretability and the robustness of the model explanation having not yet been adequately studied [43–47]. In spite of the high classification accuracy of such advanced architectures as EfficientNet, dilated convolutional networks, current evaluation frameworks mostly focus on predictive power without evaluating the stability of the explanation or the network’s strength to perturbations in the inputs [48–52].

Numerous segmentation architectures have been put forward to enhance the localization of tumor boundaries. Attention-based models, including ABANet, strive to improve the awareness of the boundaries but might not have enough contextual fusion mechanisms [53]. The use of traditional biomedical segmentation networks, e.g., U-Net and U-Net++, continues to be quite common, but they are unable to capture long-range dependencies effectively unless using transformer-based encoders or deep supervision [54]. Other interpretable segmentation models have graphical interpretability, but they lack explanation measures in model choice or ensemble generation [55].

Moreover, the existing studies on transfer learning and cross-dataset generalization are primarily aimed at enhancing the accuracy of the classification but ignore the fidelity of the explanation [56]. Ensemble-based methods tend to increase the predictive accuracy without assessing the stability of visual explanations during imaging distortions in the real world [57,58]. Comparison of CNN with transformer models also does not focus on explaining stability, but still prioritizes performance measures [59]. More recent explainable diagnosis systems also do not formally test the reliability of the explanation in terms of perturbation [60–62]. Even though, e.g., U-Net++ and modified SegNet can achieve better segmentation results, they cannot yet incorporate quantitative explanatory stability into their own decision processing pipelines [63,64].

Despite the large number of deep learning models that were proposed in brain tumor segmentation and classification, there are also a number of limitations in the current research. There are numerous methods that only concentrate on segmentation or classification, and only a small number of methods have attempted to combine both of them into a cohesive framework. In addition, the stability and robustness of visual explanations under input perturbations have not been systematically evaluated or utilized as a quantitative criterion for model selection in existing studies. To better highlight these limitations, Table 1 summarizes the key research gaps in representative existing studies.

3 Proposed Methodology

The proposed research presents a two-stage explainable deep learning system to be applied in automated brain tumor detection of MRI images. The framework aims to address two key challenges: (1) the accurate delineation of tumor boundaries in complex anatomical structures and (2) the classification of tumor types with predictable and consistent outcomes despite perturbation with inputs. Fig. 1 presents the general structure with emphasis on the sequential work with raw MRI input by segmentation, classification, and explainability modules to generate the ultimate diagnostic output. With medical images, explainability methods must be spatially continuous and anatomically consistent to enable interpretation that is of clinical significance. Brain MRI classification requires clarification to determine where the discriminative areas are, within the tumor boundaries, as opposed to assigning significance to abstract dimensions of features. Accordingly, this study adopts saliency-based visualization through Grad-CAM, which generates spatially coherent heatmaps aligned with learned convolutional and transformer feature hierarchies. SHAP and LIME are attribution-based methods that can be applied to offer attribute-wise interpretability, especially when using tabular or low-dimensional data. Then, they are not so suited to the use of complex medical imaging when the precise localization of the areas and maintenance of the anatomical structure are essential. This is because they are not excluded from this study because of any shortage in their efficacy, but rather a methodological choice guided by the necessity of having spatially meaningful interpretability. Grad-CAM is therefore not used as only a post-hoc visualization model, but as a part of the whole framework, and it plays a role in model selection, ensemble weighting, and perturbation-based assessment of consistency of explanation.

images

Figure 1: Overview of the proposed two-stage explainability-aware framework for brain tumor segmentation, classification, and explanation consistency analysis.

The subsections that follow explain each component in detail, with the preprocessing stage being the first step to all the downstream learning activities.

3.1 Dataset

The BRISC2025 dataset has been used in this study as a benchmark of brain tumor segmentation and classification by using MRI images [20]. The data is freely available on such official repositories as Kaggle [13], Figshare [16], and Zenodo, which makes it transparent and allows researchers to reproduce the obtained results. The BRISC2025 dataset was assembled by combining brain MRI images from various publicly available datasets and made consistent in a single standard reference dataset. Once integrated, there was a data refining process involving quality control, annotation control, and expert control. A certified radiologist and physician reviewed all tumor annotations and made sure that they were clinically plausible and consistent.

This data is 6000 brain MRI scans, a single slice, and T1-weighted. The current dataset release does not include other MRI modalities, including T2, FLAIR, or contrast-enhanced MRI sequences. The MRI images have been stored in the form of JPEG files, but the tumor segmentation masks of pixels have been stored as PNG files. The dataset consists of MRI slices with three anatomical orientations, i.e., axial, coronal, and sagittal, so that features that are invariant to imaging orientation can be learned. The data set favors two concomitant learning exercises. The initial challenge is pixel-level tumor segmentation, in which binary masks are given to define the entire tumor area. The second one is image-level multi-class tumor classification, which has four diagnostic groups: Glioma, Meningioma, Pituitary Tumor, and No Tumor. As opposed to volumetric datasets, which consider the glioma segmentation process only, BRISC2025 allows evaluating tumor localization and multi-class tumor diagnosis in a single framework.

To evaluate the experimental value, the data set was separated into training and testing data to prevent leakage of data. Precisely, the training (5000) and the testing (1000) images, which correspond to a data split ratio of 83% and 17%, respectively [20]. Table 2 gives a class-wise representation of the images in the training set and testing set.

images

Table 3 shows sample segmentation cases in tumor types and imaging planes that are provided as original MRI slices, binary masks, and tumor overlays.

images

Although volumetric multimodal MRI data are conventional in clinical neuro-oncology workflows, the adoption of 2D T1-weighted images [25] in BRISC2025 indicates a limitation on the dataset level, as opposed to a limitation on the proposed framework. The dataset is categorically aimed at benchmarking algorithms, comparing models, and developing methodologies in brain tumor segmentation and classification, as opposed to actual clinical diagnostic use. Notably, the suggested framework and explainability-aware ensemble approach are modality-neutral and can be generalized to volumetric and multimodal MRI environments in case relevant annotated data is made accessible. Therefore, the findings reported in this study should be interpreted as internal methodological validation under the context determined by the dataset.

3.2 Data Preprocessing

The preprocessing stage is critical in determining the quality and reliability of segmentation and classification, as it determines the performance of all the subsequent learning stages directly [4]. The BRISC2025 dataset included 6000 T1-weighted MRI slices each; a four-step preprocessing pipeline has been used in this study [20]. It involved the skull stripping, standardization of intensities, scaling of images, and standardization of the space of images, which aims at addressing typical imaging artifacts in multi-institutional datasets of MRI [13]. Intensity thresholding, morphological filtering, and connected component analysis were combined to strip the skull to eliminate non-brain tissue, with peripheral tumor regions preserved [36,65]. This was followed by intensity normalization (to zero mean and unit variance) and min-max normalization to [0, 1] to reduce variations caused by scanners and improve model convergence [16]. All MRI slices were also rescaled to 224 × 224 pixels using bilinear interpolation to maintain the size of the input and important anatomical features without making the computation intensive. A number of augmentation methods were included in the model training to enhance generalization and minimize overfitting. The augmentation pipeline included random rotations, Gaussian blur, and noise RNA, which are the changes that are common in clinical MRI images. This preprocessing system is anatomically consistent and can see tumors, but provides standardized information to the subsequent steps of segmentation and classification. The preprocessing steps, the parameters, are summarized as illustrated in Table 4.

images

3.3 Brain Tumor Segmentation Using Swin-DS-HAFUNetv2

After the preprocessing, brain tumor segmentation is the second most significant phase of the two-stage system suggested, since it should provide adequate localization of the tumor to ensure strong classification and soundness of visual explanations. Inaccurate segmentation may result in unreliable Activation of irrelevant backgrounds, misclassification, and untrustworthy explanatory visualizations.

To overcome these issues, Swin-DS-HAFUNetv2, an improved transformer-based segmentation architecture, was proposed as a part of the given study and developed to be able to produce tumor boundaries and long-range contextual interactions in T1-weighted brain MRI scans. The Swin-DS-HAFUNetv2 is built on top of the Swin-DS-HAFUNet [20], which has been previously presented and incorporates superior hierarchical attention fusion and contextual transformer bottleneck design along with multi-scale deep supervision.

The overall architecture (Fig. 2) is in the form of an encoder-decoder architecture with attention-directed skip connections. Swin Transformer blocks will be used to extract the feature representations as hierarchies by the encoder and then reassemble the segmentation results gradually in high-resolution by adopting adaptive context aggregation and multi-scale feature integration by the decoder. Four primary elements are going to be included in the proposed architecture.

images

Figure 2: Block diagram of the proposed Swin DS HAFUNetv2 model for brain tumor segmentation.

• Hierarchical Swin Transformer encoding

• Enhanced Hierarchical Attention Fusion (E-HAFv2)

• Contextual Bottleneck Transformer (CBT)

• Deep supervision-based decoder refinement

All these elements enhance the segmentation performance, especially in low contrast tumors and abnormal tumor morphologies.

3.3.1 Tokenization and Swin Transformer Encoder Design

The encoder consists of a hierarchical Swin Transformer backbone, which consists of extracting multi-scale contextual representations of the input MRI slices. Individual input MRI images are downsampled to 224 × 224 × 3 and broken down into 4 × 4 non-overlapping patches, which is the typical Swin Transformer tokenization approach. The result of this operation is: 2244×2244=56×56 resulting in 3136 visual tokens. The patches are, in turn, embedded into an embedding space, where a linear embedding layer is used to give a feature representation of the number of channels of an embedding dimension of 96.

The encoder has four layers of Swin Transformer that are made of window-based multi-head self-attention blocks and feed-forward layers. The shifted-window attention mechanism enables the network to model both local and global contextual dependencies while maintaining linear computational complexity. Between stages, patch merging operations are applied to reduce spatial resolution and increase feature dimensionality. This hierarchical design enables the encoder to progressively capture richer semantic representations of tumor structures. The spatial resolution and channel dimensions across encoder stages are summarized in Table 5.

images

This hierarchical feature extraction enables the model to acquire fine-tumor boundary information as well as global context relational information across the image.

3.3.2 Enhanced Hierarchical Attention Fusion (E-HAFv2)

Swin-DS-HAFUNetv2 uses an Enhanced Hierarchy Attention Fusion (E-HAFv2) module to minimize the semantic gap between the encoder and decoder representations. Unlike in the approaches of traditional skip connections, which simply concatenate encoder feature maps and decoder feature maps, E-HAFv2 uses attention-guided fusion to highlight the tumor-relevant areas and reduce the background noise.

As shown in Fig. 3, blocks of encoder and decoder characteristics with the same spatial resolution are initially optimized by Swin Transformer to enhance the strength of the contextual features. The maps of the features obtained are then concatenated, and a 1 × 1 convolution layer is applied to reduce the dimension. After this, a Squeeze-and-Excitation (SE) gating strategy is then used to recalibrate channel-wise feature responses adaptively, maximizing discriminative tumor features and reducing redundant/less relevant information.

images

Figure 3: E-HAFv2 module with attention-guided skip integration and channel recalibration.

Such a hierarchical approach of combining attention enhances the accuracy of tumor boundaries detection, especially in areas of the brain that are structurally complex, like the ventricles and cortical folds.

3.3.3 Contextual Bottleneck Transformer (CBT)

Swin-DS-HAFUNetv2 has a Contextual Bottleneck Transformer (CBT) in the deepest level of the network to obtain global contextual dependencies across the entire feature map. As illustrated in Fig. 4, CBT substitutes the conventional convolutional bottleneck with a transformer-based module, which incorporates:

• Multi-Head Self-Attention (MHSA)

• Layer Normalization

• Feed-Forward Networks (FFN)

• Residual connections

images

Figure 4: Contextual Bottleneck Transformer module used at the bottleneck of Swin DS HAFUNetv2.

This design allows the model to capture long-range dependencies across tumor structures that may span large spatial regions in MRI images. Additionally, CBT enhances the stability of feature propagation and allows the network to separate diffuse tumor boundaries more effectively than would be suitably captured by local convolutional filters.

3.3.4 Decoder Architecture and Adaptive Context Aggregation

The decoder gradually recombines high-resolution segmentation maps with the help of hierarchical encoder attributes by using patch expansion and attention-fused features. The three major operations that are carried out by each decoder stage are:

1. Patch Expansion: Progressively, patch expansion operations (i.e., transposed convolution or interpolation) are carried out on the feature maps to restore the spatial resolution.

2. Attention-Guided Feature Fusion: The E-HAFv2 module is used to combine encoder features with decoder representations, which enables the decoder to take advantage of low-level spatial features and high-level semantic context.

3. Contextual Feature Refinement: Feature maps go further to be refined with convolutional blocks with contextual enhancement features.

Deep supervision is used across various levels of the decoder to stabilise the training and enhance gradient propagation. The results of auxiliary segmentation (DS1, DS2, DS3) at intermediate resolutions are added to the total loss function, as shown in Fig. 5.

images

Figure 5: Adaptive context aggregation with multi-level deep supervision in Swin DS HAFUNetv2.

This multi-scale supervision strategy enhances the consistency of the segmentation and increases the speed of the network convergence.

3.3.5 Loss Function and Optimization

Given the class imbalance that is typically seen in brain tumor segmentation problems, a hybrid loss function is used to trade region overlap and pixel-level classification performance.

The segmentation loss is defined as in Eq. (1):

ℒseg =ℒDice+ℒCE+ℒFocal Tversky (1)

To add deep supervision, the overall training goal can be a combination of the losses of results of the intermediate decoder as represented in Eq. (2):

ℒtotal=ℒDS1+0.6ℒDS2+0.3ℒDS3(2)

This weighted representation will guarantee that the high-resolution predictions will dominate the optimization process, with intermediate outputs giving auxiliary gradient signals. The Adam optimizer that employs a ReduceLROnPlateau learning rate schedule is used to optimize the network. Table 6 provides a summary of the training setup of Swin-DS-HAFUNetv2.

images

The output segmentation of Swin-DS-HAFUNetv2 is considered to be the key element of the proposed explainability-aware framework, as it is the main input of the downstream classification phase. The framework provides classification models with pure tumor areas instead of unrelated background or imaging artifacts, as it explicitly isolates tumor areas and then classifies them. This tumor-centric representation is not only more accurate at making classification predictions, but also the visualizations provided by the corresponding Grad-CAM are more reliable. Specifically, Swin-DS-HAFUNetv2 generates a tumor mask that limits the spatial concentration of Grad-CAM explanations in choices, which ensures that saliency maps reflect anatomically significant parts of the tumor. The correct tumor segmentation is thus in direct correspondence with the explanation-consistency filtering strategy used in the ensemble classification phase, where attention maps are considered in anatomically relevant areas. This leads to saliency maps created that are more in line with tumor boundaries and have a lower rate of false activations, thus enhancing the interpretability and clinical accuracy of the prediction made by the model.

In general, Swin-DS-HAFUNetv2 offers a powerful and contextual segmentation backbone that can be seamlessly combined with an explainability-aware classification pipeline. The architecture shows a high level of scalability in terms of feature representation, interpretability, and performance of segmentation across the MRI slices in the BRISC2025 dataset, and this indicates the methodological and clinical viability of the proposed two-stage architecture.

3.4 Brain Tumor Classification with ECWMEv2 Framework

Its classification stage is based on the tumor-centered areas obtained on the segmentation result of Swin-DS-HAFUNetv2, and the classification models are only trained to capture the discriminative features of the tumor area instead of the background structures. The second phase of the proposed framework is devoted to the segmentation of the segmented brain tumors into four clinically significant categories: Glioma, Meningioma, Pituitary Tumor, and No Tumor. This phase uses a new ensemble architecture named ECWMEv2 (Explanation-Consistent Weighted Meta-Ensemble v2) that aims at maximizing predictive accuracy, as well as providing the property of interpretability, explanation stability, and resistance to perturbations in the input.

In contrast to traditional ensemble procedures that can only use predictive confidence, ECWMEv2 incorporates explanation-based criteria in several levels of model selection and fusion. The former is based on a Grad-CAM stability analysis, which is used to narrow down the models whose saliency maps are not sensitive to input perturbation. Models with stable, localized, and clinically meaningful forms of attention are the only ones that are kept when building up an ensemble. An explanation-weighted soft voting system is then used to aggregate the filtered models, with a higher voting weight being assigned to models with a more focused and consistent explanation. Finally, one more method that learns on the model level logit predictions to enhance the robustness and generalization of the final ensemble is a meta-learning layer, which is built on top of the final predictions and is based on XGBoost. The overall ECWMEv2 explainability-aware classification pipeline is illustrated in Fig. 6.

images

Figure 6: Detailed architecture of the proposed Explanation-Consistent Weighted Meta-Ensemble (ECWMEv2) classification framework.

3.4.1 Candidate Models and Feature Extraction

The ECWMEv2 ensemble uses a pool of five transformer-based and convolutional neural architectures, which were chosen due to their variety of architectural biases and performance on medical imaging tasks. The model pool includes CoAtNet, ConvNeXt, Swin Transformer, MaxViT, and EVA02. CoAtNet is a combination of the convolution and self-attention features that help to compromise local spatial details and global dependencies. ConvNeXt is an updated design of CNNs that also considers design ideas of transformers but preserves the performance of convolutional operations. Swin Transformer applies window-based hierarchical self-attention, which allows learning fine-grained spatial structures in a scaled manner. MaxViT makes use of both block-based attention and grid-based attention to effectively combine local and global features. Lastly, EVA02 is a large-capacity vision transformer having trainable parameters with masked image models, which has high transfer learning in low-data medical imaging. All models are initialized with ImageNet-1K pretrained weights and fine-tuned on tumor-focused regions extracted from the BRISC2025 dataset using the output of the segmentation module. This guarantees both the localization of representation and the diagnostic relevance of the models. Table 7 gives a detailed description of the model specifications and training configurations.

images

3.4.2 Grad-CAM Consistency Filtering for Explanation Reliability

Grad-CAM-based explanation consistency is used to assess whether candidate models have stable and clinically meaningful attention patterns, and only models that pass this test are included in the ensemble. Models that satisfy pre-determined consistency criteria are added to the final ensemble, and the other models are eliminated. The explanation consistency evaluation procedure, including similarity metrics, threshold criteria, and empirical justification, is detailed in Section 3.5.

3.4.3 Explanation-Weighted Ensemble Construction

Following the explanation, consistency filtering, the retained models are not treated equally. Instead, their contributions to the final prediction are weighted according to the degree of Grad-CAM stability observed during the filtering stage.

3.4.4 Meta-Ensemble Layer Using XGBoost

To further refine the decision-making process, ECWMEv2 uses a meta-classifier that is trained on the logits of the retained models, allowing a refined prediction using model-level feature combination. The logit vectors of each backbone are obtained as raw and concatenated together and provided as features to an XGBoost classifier. The meta-layer is a non-linear learner of relationships between predictions made by a model and can be used to resolve conflicts between base models adaptively. It is trained on a stratified validation split on a binary logistic loss and L2 regularization. Such a two-level fusion, being an explanation-filtered soft voting and then a meta-classification, allows the framework to achieve high generalizability without the loss of explainability.

3.4.5 Training and Optimization Details

The backbone models are optimized with the help of uniform hyperparameters and data augmentation plans. The output layers associated with the model are retrained in a multiclass classification and are trained by the cross-entropy loss. The training is done on NVIDIA A100 GPUs with PyTorch and mixed-precision so as to speed up convergence. XGBoost meta-classifier is implemented using the validation split with the maximum depth of 3, the learning rate of 0.05, and the L2 regularization weight of 1.0. Table 8 lists the training setup in each of the models in the ECWMEv2 pool.

images

To reduce the risk of overfitting and improve model generalization, several strategies were employed during training. First, transfer learning was applied by initializing backbone networks with ImageNet-1K pretrained weights. Second, it was augmented with vast amounts of data by using rotation, flipping, blur, and noise injection techniques. Third, early stopping based on patience of five epochs was used to avoid excessive training when the validation performance stopped improving. Further to guarantee the robustness and minimize sampling bias, five-fold cross-validation was used in experimental assessment, and the dataset was split into five folds, with the model being trained and evaluated on alternate splits. In the case of the segmentation network, deep supervision has been added to enhance gradient propagation and minimize overfitting at intermediate layers of features.

3.5 Explainability Analysis and Consistency Evaluation Using Grad-CAM

3.5.1 Explainability Integration within the Segmentation–Classification Pipeline

The explainability aspect of the suggested framework is achieved utilizing Grad-CAM as a post hoc interpretability approach, as opposed to utilizing internal transformer attention maps employed directly. Although transformer self-attention offers implicit interactions between features, attention weights are not always spatially interpretable as saliency regions of medical images. Thus, Grad-CAM is used to produce spatially localized heatmaps, which show areas of the tumor that are used in classification predictions.

In contrast to the traditional methods, when Grad-CAM is applied to visualize the model once the inference is made, the offered framework incorporates the concept of explainability within the model selection and the model ensemble building. In particular, Grad-CAM maps obtained using a candidate model are checked on controlled input perturbations. Models that give unstable or inconsistent explanations are not included in the ensemble. Therefore, explainability is a clear reliability metric used to filter and weight models, and interpretability is an inseparable part of the decision-making pipeline and not a purely descriptive post hoc visualization. The saliency-based attribution with Grad-CAM is employed to create spatially consistent attention maps, which are appropriate to interpret the classification of brain tumors. Grad-CAM, in contrast to SHAP or LIME, which feature-level attribution, offers anatomically aligned heatmaps, which are required in medical imaging tasks where regional continuity is important.

For given class c, Grad-CAM computes the importance weight of feature map k in the final convolutional layer as:

αkc=1Z∑i∑j∂yc∂Aijk(3)

Here, yc is the score for class c, Ak is the kth feature map. Z is the spatial normalization factor. The Grad-CAM heatmap is then computed as:

Gc=ReLU(∑kαkcAk)(4)

This produces a spatial saliency map indicating image regions contributing to the prediction. Each model in the classification ensemble produces Grad-CAM maps of clean and perturbed MRI inputs. Perturbations comprise: Rotation, Blur, and Noise injection simulating the clinically relevant variations of input. To measure the stability of the explanation, two measures are employed in the case of each pair of maps (clean Gc and perturbed Gp):

• Intersection over Union (IoU) is a measure of spatial overlap:

IoU(Gc,Gp)=|Gc∩Gp||Gc∪Gp|≥0.6(5)

• Cosine Similarity is a measure of the directional consistency of attention distributions:

Cosine(Gc,Gp)=Gc⋅Gp‖Gc‖‖Gp‖≥0.6(6)

The models that satisfy both thresholds (IoU≥0.6, Cosine Similarity ≥0.6) in 80% of the test samples are said to be explanation-consistent. The pass rate is calculated as in Eq. (7). This forms the Explanation Reliability Layer (ERL) in the ECWMEv2 framework, which filters out unstable models before ensemble construction.

Pass Rate=Samples with Stable CAMTotal Test Samples×100\% (7)

The retained models are not treated equally to ensure that there is fairness in the weighting of the models. Instead, each model is assigned a weight wm b proportional to its explanation consistency, calculated in Eq. (8):

wm=12(AvgIoUm∑jAvgIoUj+AvgCosm∑jAvgCosj)(8)

The final class probability is obtained via explanation-weighted soft voting expressed in Eq. (9). Here, Z ensures normalization and pm(c∣x) is the softmax probability from model m for class c.

PECWMEv2(c∣x)=1Z∑m∈Msclectedwm⋅pm(c∣x)(9)

3.5.2 Threshold Selection and Trade-Offs

Thresholds of IoU ≥ 0.6 and Cosine Similarity ≥ 0.6 with a pass rate ≥ 80% were determined empirically to be able to compromise between strictness and diversity in the ensemble. Table 9 demonstrates the impact of stricter and stricter thresholds: extremely conservative thresholds (0.8 or higher) will weed out all stable models, whereas moderately balanced thresholds will weed out none. It is not aimed at imposing the same attention maps but providing consistency that is both anatomically significant and clinically interpretable.

images

3.5.3 Results of Consistency Filtering

When all models were applied using the chosen thresholds, ConvNeXt, Swin Transformer, and EVA02 only had consistently stable and well-localized Grad-CAM heatmaps with tumor regions when perturbed. Conversely, CoAtNet as well as MaxViT were either diffuse or unstable and were eliminated in the final ensemble. The high consistency of the explanations by the average Grad-CAM IoU of 0.692 justifies the capability and transparency of the model. The quantitative analysis of the reliability of each model is provided in Table 10 to present the reliability of its explanation and retention.

images

The analysis has resulted in Grad-CAM being not just a visualization tool, but an important component of the ensemble in terms of filtering, weighting, and reliability of models. ECWMEv2 improves the readability, strength, and clinical credibility by ensuring that only models that are consistent and explanatory are used to make the predictions. The proposed explanation consistency filtering method is described in Algorithm 1.

images

3.6 Performance Evaluation Metrics

The proposed framework is analyzed on three significant stages: tumor segmentation, tumor classification, and consistency of explanations (discussed in Section 3.5). The test split of the BRISC2025 dataset is used in all the experiments according to Table 2, and they are performed with subject-level separation to avoid data leakage between folds.

3.6.1 Segmentation Evaluation

Evaluation of the segmentation model (Swin-DS-HAFUNetv2) is measured in terms of existing spatial overlap measures. The Dice coefficient is used to measure the overlap between the predicted segmentation P and ground truth mask G, and it is given in Eq. (10):

Dice Coefficient =2|P∩G||P|+|G|(10)

Eq. (11) is also used to compute the Intersection over Union or Jaccard Index:

IoU=|P∩G||P∪G|(11)

To further analyze the model in terms of discriminative ability to divide into relevant areas, precision and recall are calculated using Eqs. (12) and (13):

Precision=TPTP+FP(12)

Recall=TPTP+FN(13)

Here,TP,FP, and FN denote true positives, false positives, and false negatives, respectively.

3.6.2 Classification Evaluation

The classification module is evaluated by measuring its accuracy and classes. The total classification precision is in Eq. (14):

Accuracy=TP+TNTP+TN+FP+FN(14)

To control the imbalance of classes, the F1-score is calculated as the harmonic mean of the precision and recall using Eq. (15):

F1−score=2×Precision ⋅ Recall Precision + Recall (15)

4 Experimental Results and Analysis

This section discusses in detail the performance of the explainability-integrated framework proposed, and the structure consists of Swin-DS-HAFUNetv2 to segment tumors and ECWMEv2 to classify. The assessments are in the form of evaluating qualitative and quantitative segmentation, ablation studies, consistency of explanations evaluation, and ultimate ensemble performance. The combination of the results proves the adequacy, clinical significance, and interpretability of the proposed framework.

4.1 Segmentation Results

4.1.1 Qualitative Analysis of Segmentation Performance

To obtain a more in-depth understanding of the Swin-DS-HAFUNetv2 model, qualitative segmentation results on the validation set were viewed and clustered into good (≥0.90), borderline (0.75–0.85), and bad (≤0.70) results as illustrated in Fig. 7. This classification shows variability of segmentation performance and represents recurring patterns on clinical information.

images images

Figure 7: Qualitative Investigation of the segmentation outcomes based on the offered Swin-DS-HAFUNetv2 model. (a) Good examples (High Dice ≥ 0.90): The prediction of masks is very close to ground truth, and there are few FP, and FN (b) Borderline examples (Medium Dice 0.75–0.85): Prediction of masks is close to the ground truth, FP, and FN are moderate values, (c) Bad examples (Low Dice 0.70): There is a significant failure in segmentation, namely, either a missed tumor.

4.1.2 Training Convergence and Stability across Folds

Mean IoU (mIoU) per epoch was used to monitor the training process and was performed on a four-fold cross-validation of segmentation. The model demonstrated a steady convergence of the model along the folds, as demonstrated in Fig. 8, and optimality was reached at epoch 25; the output thereafter stabilized. The mean mIoU of the architecture was 0.865 ± 0.16, which validates the consistency of the architecture on varying validation splits as well as the efficiency of deep supervision and integration of attention.

images

Figure 8: mIoU vs. Epochs for all folds.

4.1.3 Quantitative Results of Cross-Validation and Statistical Testing

Table 11 demonstrates the results of segmentation by cross-validation folds. The outcomes show a consistent performance, and the Dice scores are between 0.9780 and 0.9785. The model is relatively stable and reliably performed across all folds with an average precision of 0.9363 and a recall value of 0.9040 to 0.9091.

images

Further statistical reliability testing was done on the proposed model, Swin-DS-HAFUNetv2, through cross-validation analysis of the confidence interval. The mean of the Dice score was 0.9782, and the standard deviation was 0.0003; hence, the 95% confidence interval of the score was 0.9779–0.9785. The small range suggests that the performance of segmentation is close to the same value in the various validation splits, which proves the performance gains are statistically constant and not simply by chance.

4.1.4 Ablation Study of the Segmentation Pipeline

To quantify the individual contributions of the E-HAF and the E-CBE modules, a systematic ablation study was conducted, as summarized in Table 12. The baseline Swin DS HAFUNetv2 model without either component achieved a weighted mIoU of 85.2%. The usage of E-HAF in isolation enhanced multi-scale feature integration, achieving consistent improvement in all tumor types and weighted mIoU to 86.5%. The added E-CBE further contributed to better contextual modeling of the bottleneck at the global scale, the weighted mIoU increased to 87.1%, and induced significant positive changes for glioma and meningioma cases. The overall performance of the model was the greatest when both E-HAF and E-CBE were used, and the glioma group had the highest increase. These findings indicate that E-HAF and E-CBE enhance each other in terms of benefits, and when they are combined, they provide a significant advantage to tumor boundary demarcation and contextual insight, especially of more complex and heterogeneous tumor structures. The results confirm the idea that E-HAF enhances the feature fusion process of different scales, and E-CBE enhances the bottleneck presentation. Together, they produce significant mIoU improvements, especially on the more complicated glioma subtype.

images

4.1.5 Comparative Evaluation with Existing Models

The proposed Swin-DS-HAFUNetv2 model was tested and compared to some of the current segmentation models. Recent transformer-based segmentation models such as TransUNet, Swin-UNet, and UNETR have demonstrated strong performance in medical image segmentation tasks [66]. However, many of these architectures are primarily designed for volumetric multimodal MRI datasets such as BraTS. Since the BRISC2025 dataset contains single-slice T1-weighted images, direct comparison with these models requires substantial architectural adaptation. As shown in Table 13, the proposed method significantly outperforms existing studies, particularly in glioma, meningioma, and pituitary cases.

4.2 Classification Results

4.2.1 Single-Model Performance

Each backbone model was trained on the classification task. The results in Table 14 indicate that Swin Transformer and ConvNeXt outperformed the rest in terms of both predictive and explanation consistency.

images

4.2.2 Grad-CAM Explanation Consistency Analysis—Perturbation Stability

Explanation consistency was assessed under noise, rotation, and contrast jitter. Swin, ConvNeXt, and EVA02 passed the minimum threshold (IoU ≥ 0.6, cosine similarity ≥ 0.6, pass rate ≥ 0.8), as shown in Tables 10 and 15. Qualitative overlays of Grad-CAM on tumor regions of a sample data in Fig. 9 demonstrate moderate overlap (Dice = 0.587, IoU = 0.416), confirming alignment between model attention and pathology.

images

Figure 9: Visual and quantitative assessment of Grad-CAM alignment with tumor pathology, showing overlap between attention maps and ground-truth tumor regions under clean inputs: (a) original input MRI slice, (b) resized binary tumor ground truth mask, (c) tumor mask overlaid in red, (d) Grad-CAM heatmap contour overlaid on the MRI and tumor mask, and (e) full Grad-CAM heatmap with IoU and Dice values. The moderate overlap (IoU = 0.416, Dice = 0.587) indicates partial alignment between model attention and tumor region, validating the clinical relevance of explanations.

Fig. 10 contains Grad-CAM visualization with varying perturbations of inputs, which clearly shows that some models are more stable to explanations. It is also evident that Swin Transformer and ConvNeXt always have well-localized and strong attention maps, which do not change in brightness, noise, and rotation perturbation, meaning that they can consistently focus on tumor localities. EVA02 will show relatively wider and noisier activations, but they still are semantically meaningful and would yield valuable global contextual data. CoAtNet, in contrast, exhibits fragmented and dispersed attention, which is highly sensitive to perturbations, whereas MaxViT generates weak or diffuse activations that are of low anatomical relevance. These findings show that Swin Transformer, ConvNeXt, and EVA02 are the only ones that meet the quantitative explanation-consistency criteria (IoU ≥ 0.60, cosine similarity ≥ 0.60, and pass rate ≥ 0.80) and are thus included in the final ensemble.

images

Figure 10: Grad-CAM visualizations under various input perturbations for different models.

4.2.3 Explanation-Consistent Meta-Ensemble (ECWMEv2)

The ECWMEv2 framework proposed was compared to the explanation-weighted soft voting and single classifiers. As shown in Table 16, ECWMEv2 employing an XGBoost meta-learner achieved superior class-balanced performance, attaining a macro-F1 score of 0.9867 with an accuracy of 0.9917. In comparison, explanation-weighted soft voting achieved lower accuracy (0.9863) and lower macro-F1 (0.9734), indicating reduced robustness in handling inter-class variability.

images

To verify that the performance improvement achieved by ECWMEv2 is statistically meaningful, statistical significance testing was conducted using five-fold cross-validation results. The proposed ECWMEv2 ensemble achieved an average accuracy of 0.9917 ± 0.0046, compared with 0.9863 ± 0.0031 for explanation-weighted soft voting. Similarly, the Macro-F1 score improved from 0.9734 ± 0.0062 to 0.9885 ± 0.0075. The stability of ECWMEv2 is also attested by the five-fold results of the cross-validation in Table 17, with accuracy ranging between 0.9897 and 1.0000, and F1 scores ranging between 0.9789 and 1.0000. The mean of folds (accuracy = 0.9917, macro-F1 = 0.9867) indicates a high generalization and a low variance, which proves that the model selection based on explanation consistency is part of the consistent and high accuracy of classification.

images

4.2.4 Ablation Study of the Classification Pipeline

Table 18 summarizes the effect of each component in the ECWMEv2 classification framework. Using the best-performing single model (Swin Transformer) as a baseline, we observed an accuracy of 98.71% and a macro-F1 of 98.12%, with high Grad-CAM localization consistency (Mean IoU = 0.722). When the Explanation Reliability Layer (ERL) was removed, and all models were included without Grad-CAM filtering, performance dropped to 98.42% accuracy and 97.81% macro-F1, with a significantly lower explanation quality (Mean IoU = 0.461) due to the inclusion of unstable models like CoAtNet and MaxViT.

images

The fusion of explanation-based and uniform voting to the filtered models resulted in 98.83% accuracy and 98.29% macro-F1, indicating that explanation-weighted fusion enhances calibration and class balance. The overall performance is the best with ECWMEv2 architecture, both Grad-CAM consistency filtering (ERL) and explanation-weighted soft voting, with the final accuracy of 99.17, the macro-F1 of 98.67, and the Mean IoU of 0.692. These results confirm that the model selection that reduces problems with an explanation and focuses on the ensemble fusion helps to increase the reliability of prediction and the interpretability of the model in brain tumor classification.

Besides predictive performance, the computational complexity of the proposed framework was evaluated based on model parameters, training time, and inference latency.

4.3 Computational Complexity of Proposed Framework

Table 19 provides the computational requirements of the proposed framework. Swin-DS-HAFUNetv2, the segmentation backbone consists of around 63 million parameters and takes 2.5–3 h to train on one fold, the amount of computation needed to run a transformer-based architecture. The classification backbones (Swin, ConvNeXt, and EVA02) have 88 to 354 M parameters, and they can be trained in a matter of 1.5–2 h per model. Such training costs imply that model development is computationally expensive and thus only centralized computing environments with GPUs are suitable.

images

Although the cost of training is still relatively high, the inference can be obtained with the minimum latency, which is essential in terms of practical implementation. The segmentation module takes 35 to 35 ms/image, and the classification models take 25–40 ms/image. The ensemble module based on XGBoost is the one adding almost no overhead, as it takes less than one minute to train and about 3 ms per prediction.

Overall, the total time to process an image with the end-to-end pipeline is about 70–90 ms, which indicates that segmentation, classification, and explainability-aware ensemble learning do not have prohibitive inference costs. These findings indicate that, despite the framework not being optimized to accommodate ultra-low-resource or edge devices, it can still be used in offline analysis or near-real-time analysis clinical research processes, where robustness and interpretability are desired over the use of a minimal computational footprint.

4.4 Comparative Analysis with Existing Studies

Table 20 is a quantitative comparison of the proposed framework and the recent research on the segmentation and classification of brain tumors. The existing hybrid segmentation-classification models based on convolutional architectures typically obtain segmentation Dice scores of about 0.91 and classification accuracy 98% and are therefore limited to detecting complex tumor margins and overall contextual information [64]. They achieve predictive accuracy of the transfer-learning-based classification methods up to 98.49; however, only classification issues are addressed, and no explicit tumor localization and explanation-sensitive evaluation methods are applied [56]. Hybrid CNN Transformer classifiers extend the representation of features and also report accuracy levels of classification in the range of 96–98, but do not incorporate stages of segmentation and do not quantitatively evaluate the stability of explanation [25]. Multi-class MRI tumor classification with ensemble CNN models, including InceptionV3 + Xception, has a validation accuracy of about 98.30%, yet does not support segmentation or interpretability, which is essential when using in practice [58]. Transformer-based approaches analyzed using the BRISC dataset are more robust in terms of baseline, segmentation Dice scores of 0.94–0.95, and classification accuracy of roughly 98%, but explainability in these works is viewed largely as a post hoc visualization tool and is not considered in model selection or ensemble building [20]. Comparative studies of CNN and hybrid Transformer models validate the effectiveness of Transformer-based classifiers and their accuracy of 98.2, being restricted to classification-only pipelines [59]. Explainable deep learning frameworks using NASNet with Grad-CAM and LIME achieve about 92.98% classification accuracy, improving transparency, but remain limited to binary classification without segmentation or quantitative explainability evaluation [42]. Hybrid explainable models that combine SHAP or Grad-CAM achieve classification rates of between 97% and 98%; however, explanation robustness to perturbations is not considered, and interpretability does not affect ensemble decisions [35]. The proposed Swin-DS-HAFUNetv2 + ECWMEv2 framework also incorporates transformer-based segmentation, explanation-consistent ensemble classification, and quantitative explainability validation into one pipeline. The segmentation stage has a mean Dice score of 0.9782, which is better compared to the current CNN and Transformer-based segmentation approaches. To be classified, the ECWMEv2 ensemble with an XGBoost meta-learner has an average five-fold accuracy of 0.9917 and a macro-F1 score of 0.9867, which is better than the results of individual classifiers and the weighted voting using explanations. Notably, the consistency of the explanation is applied using perturbation-based Grad-CAM analysis, which results in a Grad-CAM IoU of 0.692, which allows the model’s attention to be stable and clinically significant.

images

5 Conclusions and Future Directions

The research introduces a unified explainability-aware system of automated brain tumor analysis through the combination of transformer-based segmentation and explanation-consistent ensemble classification. The suggested Swin-DS-HAFUNetv2 showed the high accuracy of localization of the tumor boundaries through the combination of hierarchical self-attention, contextual bottleneck modeling, and deep supervision, leading to the accurate segmentation of various types of tumors. The framework, in combination with segmentation and downstream classification, was thus able to minimize bias in the background and, additionally, to make predictive decisions based on clinically relevant tumor regions and not spurious image characteristics. Besides, the ECWMEv2 classification framework added to the design group consistency of explanation as its core principle of ensemble construction. The framework attained 99.17% of classification accuracy and stability in response to various perturbations and imposed a metric of reliability on the explanations generated by the model selection and weighting process, leading to the development of a stable and interpretable decision-making process. In contrast to the traditional approach, where explainability is regarded as post hoc visualization, the research incorporates interpretability into the architectural design and optimization of the ensemble framework, which increases robustness, transparency, and clinical trustworthiness. The proposed framework demonstrates that predictive performance can be traded off with explanation stability as a promising way to generate reliable and understandable artificial intelligence systems that are reliable at a benchmark scale and can be applied as a general methodology in future clinical-scale studies.

The explainability-aware framework proposed presents a number of viable benefits to automated analysis of brain tumors. The framework can be used to combine transformer-based segmentation with explanation-consistent ensemble classification, allowing more precise localization of tumors and also predicting their type in a single pipeline. Grad-CAM stability as a model selection criterion will make sure that classification decisions are based on coherent and anatomically significant visual explanations, enhancing the openness of the decision-making process. Moreover, the models can improve the segmentation followed by classification, which enables the models to target tumor regions and not background tissues to minimize spurious activations and enhance the reliability of the diagnosis. These features render the framework appropriate to research-based clinical decision support systems, in which interpretability and stability of prediction are critical.

5.1 Research Limitations

Although the proposed explainability-aware framework demonstrates promising performance, the following limitations should be addressed in future work. The experiments were initially conducted on 2D single-slice T1-weighted MRI images, as the BRISC2025 dataset only provides this type of data. Consequently, the framework currently does not leverage multimodal MRI data, such as T2 or FLAIR images, which are widely used in clinical neuro-oncology to further characterize tumors. Additionally, the proposed method is based on 2D MRI slices rather than 3D volumetric scans, and therefore lacks the ability to capture inter-slice contextual information—an important factor in achieving precise clinical diagnoses.

The experimental evaluation was performed using only one benchmark dataset, with no external validation conducted on other publicly available datasets such as BraTS, TCIA, or REMBRANDT. This limitation may hinder the generalization of the proposed framework to datasets with different imaging characteristics or tumor patterns.

Moreover, multi-center validation involving various scanners, institutions, or acquisition protocols was not carried out. Variations in MRI hardware, acquisition parameters, and imaging protocols in real-world clinical settings can lead to domain shifts, potentially degrading the performance of models trained on a specific dataset when applied to different clinical environments.

In addition, potential biases may exist in the BRISC2025 dataset due to the distribution of tumor types, imaging orientations, or patient demographics. If certain tumor features or imaging characteristics are over-represented, the trained models may learn dataset-specific patterns rather than generalizable diagnostic features. Although cross-validation and data augmentation were employed to mitigate overfitting, further testing on diverse clinical data is necessary to ensure fairness and robustness across different patient populations.

Furthermore, the proposed framework is built upon transformer-based architectures and ensemble learning, which entail considerable computational complexity and require a GPU-enabled training environment. While inference latency remains manageable, the training demands may limit deployment in resource-constrained settings, such as edge-based medical systems.

Finally, although Grad-CAM explanations were used to assess explanation stability and select the optimal ensemble, saliency-based explanation methods can be sensitive to model architecture and input perturbations. Therefore, additional approaches to complement interpretability techniques and uncertainty-aware explanations should be explored in the future to further assess potential biases and improve the reliability of visual explanations in clinical decision-support systems.

5.2 Future Research Directions

This research can be further advanced in the following directions. First, the proposed framework can be extended to accommodate multimodal and volumetric 3D MRI data, including T1, T2, and FLAIR modalities, to improve tumor characterization and localization. Second, incorporating uncertainty-aware learning could provide a more objective estimate of prediction confidence and help address ambiguous or borderline cases in clinical imaging. Third, model compression techniques such as pruning, knowledge distillation, or quantization can be employed to enhance computational efficiency, facilitating the integration of the system into clinical practice. Finally, multicenter validation studies using a diverse range of MRI data will be necessary to evaluate the robustness and clinical utility of the proposed explainability-aware system.

Acknowledgement: We would like to express our sincere gratitude to all individuals.

Funding Statement: The authors received no specific funding for this study.

Author Contributions: The authors confirm contribution to the paper as follows: study conception and design: Mamoona Jabbar, Uzma Jamil; data collection: Mamoona Jabbar; analysis and interpretation of results: Mamoona Jabbar, Uzma Jamil; draft manuscript preparation: Mamoona Jabbar, Bushra Zafar; review and editing: Mamoona Jabbar, Muhammad Younas; funding acquisition: Mamoona Jabbar. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The BRISC dataset is available at Kaggle (https://www.kaggle.com/datasets/briscdataset/brisc2025/), Figshare (https://doi.org/10.6084/m9.figshare.30533120), and Zenodo (https://doi.org/10.5281/zenodo.17524350).

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Anitha A, Nair A, Kamaraj B. Brain tumor detection and classification using deep neural network and interpretation using XAI techniques. In: Proceedings of the 2024 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES); 2024 Sep 20–22; Kottayam, India. p. 1–6. doi:10.1109/SPICES62143.2024.10779829. [Google Scholar] [PubMed] [CrossRef]

2. Agrawal A, Chaki J. CerebralNet meets explainable AI: brain tumor detection and classification with probabilistic augmentation and a deep learning approach. Biomed Signal Process Control. 2025;110(1):108210. doi:10.1016/j.bspc.2025.108210. [Google Scholar] [CrossRef]

3. Alqhtani SM, Soomro TA, Ali Shah A, Aziz Memon A, Irfan M, Rahman S, et al. Improved brain tumor segmentation and classification in brain MRI with FCM-SVM: a diagnostic approach. IEEE Access. 2024;12(4):61312–35. doi:10.1109/access.2024.3394541. [Google Scholar] [PubMed] [CrossRef]

4. Alqhtani SM, Soomro TA, Bin Ubaid F, Ali A, Irfan M, Asiri AA. Contrast normalization strategies in brain tumor imaging: from preprocessing to classification. Comput Model Eng Sci. 2024;140(2):1539–62. doi:10.32604/cmes.2024.051475. [Google Scholar] [CrossRef]

5. Zahoor A, Irfan M, Khan A, Usman M, Haider W. Brain tumor detection in magnetic resonance images using swin transformer. Conclus Med. 2025;1(1):1–5. doi:10.71107/kx24gt94. [Google Scholar] [CrossRef]

6. Ariful Islam M, Mridha MF, Safran M, Alfarhood S, Mohsin Kabir M. Revolutionizing brain tumor detection using explainable AI in MRI images. NMR Biomed. 2025;38(3):e70001. doi:10.1002/nbm.70001. [Google Scholar] [PubMed] [CrossRef]

7. Asif IH, Bin Haque S, Nawaz MF, Rahman MM, Khan MM. Brain tumor classification using deep learning with explainable AI (XAI). In: Proceedings of the 2025 8th International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech); 2025 Jan 31–Feb 2; Kolkata, India. p. 1–6. doi:10.1109/iementech65115.2025.10959473. [Google Scholar] [PubMed] [CrossRef]

8. Asiri AA, Soomro TA, Ali Shah A, Pogrebna G, Irfan M, Alqahtani S. Optimized brain tumor detection: a dual-module approach for MRI image enhancement and tumor classification. IEEE Access. 2024;12(1):42868–87. doi:10.1109/ACCESS.2024.3379136. [Google Scholar] [PubMed] [CrossRef]

9. Azeez O, Abdulazeez A. Classification of brain tumor based on machine learning algorithms: a review. J Appl Sci Technol Trends. 2025;6(1):1–15. doi:10.38094/jastt61188. [Google Scholar] [CrossRef]

10. Balamurugan T, Gnanamanoharan E. Brain tumor segmentation and classification using hybrid deep CNN with LuNetClassifier. Neural Comput Appl. 2023;35(6):4739–53. doi:10.1007/s00521-022-07934-7. [Google Scholar] [CrossRef]

11. Bhaskaran SB, Datta R. Explainability of brain tumor classification model based on InceptionV3 using XAI tools. J Flow Vis Image Process. 2024;32(2):35–8. doi:10.1615/jflowvisimageproc.2024054026. [Google Scholar] [CrossRef]

12. Biratu ES, Schwenker F, Ayano YM, Debelee TG. A survey of brain tumor segmentation and classification algorithms. J Imaging. 2021;7(9):179. doi:10.3390/jimaging7090179. [Google Scholar] [PubMed] [CrossRef]

13. Brain Tumor Classification (MRI). [cited 2026 Jan 1]. Available from: https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri. [Google Scholar]

14. Jetlin CP, Sherly Puspha Annabel L. PyQDCNN: pyramid QDCNNet for multi-level brain tumor classification using MRI image. Biomed Signal Process Control. 2025;100(3):107042. doi:10.1016/j.bspc.2024.107042. [Google Scholar] [CrossRef]

15. Charaabi H, Sayari A, El Hamdi R, Njah M, Ben Slima M. An XAI-infused multiclass MRI brain tumor classification using deep transfert learning (DTL). In: Proceedings of the 2024 10th International Conference on Control, Decision and Information Technologies (CoDIT); 2024 Jul 1–4; Vallette, Malta. p. 1044–9. doi:10.1109/codit62066.2024.10708599. [Google Scholar] [PubMed] [CrossRef]

16. Cheng J. Brain tumor dataset. Figshare. 2017. doi:10.6084/m9.figshare.1512427. [Google Scholar] [PubMed] [CrossRef]

17. Deepa S, Janet J, Sumathi S, Ananth JP. Hybrid optimization algorithm enabled deep learning approach brain tumor segmentation and classification using MRI. J Digit Imag. 2023;36(3):847–68. doi:10.1007/s10278-022-00752-2. [Google Scholar] [PubMed] [CrossRef]

18. Elhadidy MS, Elgohr AT, El-geneedy M, Akram S, Kasem HM. Comparative analysis for accurate multi-classification of brain tumor based on significant deep learning models. Comput Biol Med. 2025;188(1):109872. doi:10.1016/j.compbiomed.2025.109872. [Google Scholar] [PubMed] [CrossRef]

19. Ennab M, Mcheick H. Advancing AI interpretability in medical imaging: a comparative analysis of pixel-level interpretability and grad-CAM models. Mach Learn Knowl Extr. 2025;7(1):12. doi:10.3390/make7010012. [Google Scholar] [CrossRef]

20. Fateh A, Rezvani Y, Moayedi S, Rezvani S, Fateh F, Fateh M, et al. BRISC: annotated dataset for brain tumor segmentation and classification. Sci Data. 2026;13(1):361. doi:10.1038/s41597-026-06753-y. [Google Scholar] [PubMed] [CrossRef]

21. Filvantorkaman M, Torkaman PM, Filvan M, Zabihi A, Moradi H. Fusion-based brain tumor classification using deep learning and explainable AI, and rule-based reasoning. arXiv:2508.06891. 2025. [Google Scholar]

22. Gan L, Zikry TM, Allen GI. Are machine learning interpretations reliable? A stability study on global interpretations. arXiv:2505.15728. 2025. [Google Scholar]

23. Mzoughi H, Njeh I, BenSlima M, Farhat N, Mhiri C. Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI). Vis Comput. 2025;41(4):2123–42. doi:10.1007/s00371-024-03524-x. [Google Scholar] [CrossRef]

24. Iftikhar S, Anjum N, Siddiqui AB, Ur Rehman M, Ramzan N. Explainable CNN for brain tumor detection and classification through XAI based key features identification. Brain Inform. 2025;12(1):10. doi:10.1186/s40708-025-00257-y. [Google Scholar] [PubMed] [CrossRef]

25. Ilani MA, Shi D, Banad YM. T1-weighted MRI-based brain tumor classification using hybrid deep learning models. Sci Rep. 2025;15(1):7010. doi:10.1038/s41598-025-92020-w. [Google Scholar] [PubMed] [CrossRef]

26. Tampu IE, Bianchessi T, Blystad I, Lundberg P, Nyman P, Eklund A, et al. Pediatric brain tumor classification using deep learning on MR images with age fusion. Neuro Oncol Adv. 2025;7(1):vdae205. doi:10.1093/noajnl/vdae205. [Google Scholar] [PubMed] [CrossRef]

27. Javeed MD, Nagaraju R, Chandrasekaran R, Rajulu G, Tumuluru P, Ramesh M, et al. Brain tumor segmentation and classification with hybrid clustering, probabilistic neural networks. J Intell Fuzzy Syst. 2023;45(4):6485–500. doi:10.3233/jifs-232493. [Google Scholar] [CrossRef]

28. Pareek KK, Ameta GK. Explainable AI (XAI) based verification for brain tumor classification using deep learning techniques. IET Conf Proc. 2025;2025(7):1184–91. doi:10.1049/icp.2025.1569. [Google Scholar] [CrossRef]

29. Kanna RK, Salau AO. New cognitive computational strategy for optimizing brain tumour classification using magnetic resonance imaging Data. Intell Based Med. 2025;11:100215. doi:10.1016/j.ibmed.2025.100215. [Google Scholar] [CrossRef]

30. Neamah K, Mohamed F, Waheed SR, Kurdi WHM, Taha AY, Kadhim KA. Utilizing deep improved ResNet50 for brain tumor classification based MRI. IEEE Open J Comput Soc. 2024;5(11):446–56. doi:10.1109/ojcs.2024.3453924. [Google Scholar] [PubMed] [CrossRef]

31. Akgündoğdu A, Çelikbaş Ş. Explainable deep learning framework for brain tumor detection: integrating LIME, Grad-CAM, and SHAP for enhanced accuracy. Med Eng Phys. 2025;144(1):104405. doi:10.1016/j.medengphy.2025.104405. [Google Scholar] [PubMed] [CrossRef]

32. Kumar T, Brennan R, Mileo A, Bendechache M. Image data augmentation approaches: a comprehensive survey and future directions. IEEE Access. 2024;12:187536–71. doi:10.1109/ACCESS.2024.3470122. [Google Scholar] [PubMed] [CrossRef]

33. Mandle AK, Sahu SP, Gupta G. Brain tumor segmentation and classification in MRI using clustering and kernel-based SVM. Biomed Pharmacol J. 2022;15(2):699–716. doi:10.13005/bpj/2409. [Google Scholar] [CrossRef]

34. Tonmoy MR, Shams MA, Adnan MA, Mridha MF, Safran M, Alfarhood S, et al. X-Brain: explainable recognition of brain tumors using robust deep attention CNN. Biomed Signal Process Control. 2025;100(18):106988. doi:10.1016/j.bspc.2024.106988. [Google Scholar] [CrossRef]

35. Nahiduzzaman M, Abdulrazak LF, Kibria HB, Khandakar A, Ayari MA, Ahamed MF, et al. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using MRI images. Sci Rep. 2025;15(1):1649. doi:10.1038/s41598-025-85874-7. [Google Scholar] [PubMed] [CrossRef]

36. Naseer A, Yasir T, Azhar A, Shakeel T, Zafar K. Computer-aided brain tumor diagnosis: performance evaluation of deep learner CNN using augmented brain MRI. Int J Biomed Imaging. 2021;2021(2):5513500. doi:10.1155/2021/5513500. [Google Scholar] [PubMed] [CrossRef]

37. Nassar SE, Yasser I, Amer HM, Mohamed MA. A robust MRI-based brain tumor classification via a hybrid deep learning technique. J Supercomput. 2024;80(2):2403–27. doi:10.1007/s11227-023-05549-w. [Google Scholar] [CrossRef]

38. Nguyen KT, Park HM, Oh G, Vankerschaver J, De Neve W. Towards improved cervical cancer screening: vision transformer-based classification and interpretability. In: Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI); 2025 Apr 14–17; Houston, TX, USA. p. 1–5. doi:10.1109/ISBI60581.2025.10981006. [Google Scholar] [PubMed] [CrossRef]

39. Asmita, Mittal P. From black box AI to XAI in neuro-oncology: a survey on MRI-based tumor detection. Discov Artif Intell. 2025;5(1):30. doi:10.1007/s44163-025-00247-3. [Google Scholar] [CrossRef]

40. Ugbomeh O, Yiye V, Ibeke E, Ezenkwu CP, Sharma V, Alkhayyat A. Machine learning algorithms for stroke risk prediction leveraging on explainable artificial intelligence techniques (XAI). In: Proceedings of the 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT); 2024 Aug 29–31; Greater Noida, India. p. 1–6. doi:10.1109/iceect61758.2024.10739320. [Google Scholar] [PubMed] [CrossRef]

41. Akinlade O, Vakaj E, Dridi A, Tiwari S, Ortiz-Rodriguez F. Semantic segmentation of the lung to examine the effect of COVID-19 using UNET model. In: Applied machine learning and data analytics. Berlin/Heidelberg, Germany: Springer. p. 52–63. doi:10.1007/978-3-031-34222-6_5. [Google Scholar] [CrossRef]

42. Adnan KM, Ghazal TM, Saleem M, Farooq MS, Yeun CY, Ahmad M, et al. Deep learning driven interpretable and informed decision making model for brain tumour prediction using explainable AI. Sci Rep. 2025;15(1):19223. doi:10.1038/s41598-025-03358-0. [Google Scholar] [PubMed] [CrossRef]

43. Kanchanamala P, Kuppusamy V, Ganesan G. QDCNN-DMN: a hybrid deep learning approach for brain tumor classification using MRI images. Biomed Signal Process Control. 2025;101(6):107199. doi:10.1016/j.bspc.2024.107199. [Google Scholar] [CrossRef]

44. Kumar Tiwary P, Johri P, Katiyar A, Chhipa MK. Deep learning-based MRI brain tumor segmentation with EfficientNet-enhanced UNet. IEEE Access. 2025;13(1):54920–37. doi:10.1109/access.2025.3554405. [Google Scholar] [PubMed] [CrossRef]

45. Narayankar P, Baligar VP. Explainability of brain tumor classification based on region. In: Proceedings of the 2024 International Conference on Emerging Technologies in Computer Science for Interdisciplinary Applications (ICETCS); 2024 Apr 22–23; Bengaluru, India. doi:10.1109/icetcs61022.2024.10544289. [Google Scholar] [PubMed] [CrossRef]

46. Qari S, Thafar MA. Brain stroke detection and classification using CT imaging with transformer models and explainable AI. arXiv:2507.09630. 2025. [Google Scholar]

47. Qezelbash-Chamak J, Hicklin K. A hybrid learnable fusion of ConvNeXt and swin transformer for optimized image classification. IoT. 2025;6(2):30. doi:10.3390/iot6020030. [Google Scholar] [CrossRef]

48. Mahesh TR, Gupta M, Anupama TA, Kumar VV, Geman O, Kumar VD. An XAI-enhanced efficientNetB0 framework for precision brain tumor detection in MRI imaging. J Neurosci Meth. 2024;410(3):110227. doi:10.1016/j.jneumeth.2024.110227. [Google Scholar] [PubMed] [CrossRef]

49. Rahman T, Islam MS, Uddin J. MRI-based brain tumor classification using a dilated parallel deep convolutional neural network. Digital. 2024;4(3):529–54. doi:10.3390/digital4030027. [Google Scholar] [CrossRef]

50. Raja RV, Jayashankari J, Sheela S, Jancy Sickory Daisy S, Gokilam GG, Joel MR. Metrics and techniques for evaluating machine learning models and optimization algorithms. In: AI model design and data management for disease prediction. Hershey, PA, USA: IGI Global Scientific Publishing; 2025. p. 193–222. doi:10.4018/979-8-3373-5137-7.ch007. [Google Scholar] [CrossRef]

51. Rajendran S, Rajagopal SK, Thanarajan T, Shankar K, Kumar S, Alsubaie NM, et al. Automated segmentation of brain tumor MRI images using deep learning. IEEE Access. 2023;11:64758–68. doi:10.1109/access.2023.3288017. [Google Scholar] [PubMed] [CrossRef]

52. Rasa SM, Islam MM, Talukder MA, Uddin MA, Khalid M, Kazi M, et al. Brain tumor classification using fine-tuned transfer learning models on magnetic resonance imaging (MRI) images. Digit Health. 2024;10:20552076241286140. doi:10.1177/20552076241286140. [Google Scholar] [PubMed] [CrossRef]

53. Rezvani S, Fateh M, Khosravi H. ABANet: attention boundary-aware network for image segmentation. Expert Syst. 2024;41(9):e13625. doi:10.1111/exsy.13625. [Google Scholar] [CrossRef]

54. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Berlin/Heidelberg, Germany: Springer; 2015. p. 234–41. doi:10.1007/978-3-319-24574-4_28. [Google Scholar] [CrossRef]

55. Saeed T, Khan MA, Hamza A, Shabaz M, Khan WZ, Alhayan F, et al. Neuro-XAI: explainable deep learning framework based on deeplabV3+ and Bayesian optimization for segmentation and classification of brain tumor in MRI scans. J Neurosci Meth. 2024;410(6):110247. doi:10.1016/j.jneumeth.2024.110247. [Google Scholar] [PubMed] [CrossRef]

56. Shamshad N, Sarwr D, Almogren A, Saleem K, Munawar A, Rehman AU, et al. Enhancing brain tumor classification by a comprehensive study on transfer learning techniques and model efficiency using MRI datasets. IEEE Access. 2024;12:100407–18. doi:10.1109/access.2024.3430109. [Google Scholar] [PubMed] [CrossRef]

57. Sharif M, Tanvir U, Munir EU, Khan MA, Yasmin M. Brain tumor segmentation and classification by improved binomial thresholding and multi-features selection. J Ambient Intell Humaniz Comput. 2024;15(1):1063–82. doi:10.1007/s12652-018-1075-x. [Google Scholar] [CrossRef]

58. Asif RN, Naseem MT, Ahmad M, Mazhar T, Khan MA, Khan MA, et al. Brain tumor detection empowered with ensemble deep learning approaches from MRI scan images. Sci Rep. 2025;15(1):15002. doi:10.1038/s41598-025-99576-7. [Google Scholar] [PubMed] [CrossRef]

59. Thahiruddin M. CNNs vs. hybrid transformers for brain tumor classification on the BRISC dataset. J Aplikasi Teknologi Informasi Dan Manajemen. 2025;6(1):24–33. doi:10.31102/jatim.v6i1.3545. [Google Scholar] [CrossRef]

60. Tran MH, Le-Thi ND. Development of an explainable AI system for brain tumor diagnosis from MRI scans. In: Future data and security engineering. Berlin/Heidelberg, Germany: Springer; 2025. p. 63–77. doi:10.1007/978-981-95-4721-0_5. [Google Scholar] [CrossRef]

61. Yiye V, Ugbomeh O, Ezenkwu CP, Ibeke E, Sharma V, Alkhayyat A. Investigating key contributors to hospital appointment no-shows using explainable AI. In: Proceedings of the 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT); 2024 Aug 29–31; Greater Noida, India. p. 1–6. doi:10.1109/iceect61758.2024.10739123. [Google Scholar] [PubMed] [CrossRef]

62. Wu Y, Owais M, Kateb R, Chaddad A. Deep modeling and optimization of medical image classification. In: Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI); 2025 Apr 14–17; Houston, TX, USA. doi:10.1109/isbi60581.2025.10981184. [Google Scholar] [PubMed] [CrossRef]

63. Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. UNet++: a nested U-Net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Berlin/Heidelberg, Germany: Springer; 2018. p. 3–11. doi:10.1007/978-3-030-00889-5_1. [Google Scholar] [CrossRef]

64. Kusuma PV, Reddy SCM. Brain tumor segmentation and classification using MRI: modified segnet model and hybrid deep learning architecture with improved texture features. Comput Biol Chem. 2025;117(1):108381. doi:10.1016/j.compbiolchem.2025.108381. [Google Scholar] [PubMed] [CrossRef]

65. Ranjbarzadeh R, Keles A, Bendechache M. Comparative analysis of real-clinical MRI and BraTS datasets for brain tumor segmentation. IET Conf Proc. 2024;2024(10):39–46. doi:10.1049/icp.2024.3274. [Google Scholar] [CrossRef]

66. Zhu Z, Sun M, Qi G, Li Y, Gao X, Liu Y. Sparse dynamic volume TransUNet with multi-level edge fusion for brain tumor segmentation. Comput Biol Med. 2024;172(8):108284. doi:10.1016/j.compbiomed.2024.108284. [Google Scholar] [PubMed] [CrossRef]

Cite This Article

APA Style

Jabbar, M., Jamil, U., Younas, M., Zafar, B. (2026). An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI. Computer Modeling in Engineering & Sciences, 147(1), 40. https://doi.org/10.32604/cmes.2026.080241

Vancouver Style

Jabbar M, Jamil U, Younas M, Zafar B. An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI. Comput Model Eng Sci. 2026;147(1):40. https://doi.org/10.32604/cmes.2026.080241

IEEE Style

M. Jabbar, U. Jamil, M. Younas, and B. Zafar, “An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI,” Comput. Model. Eng. Sci., vol. 147, no. 1, pp. 40, 2026. https://doi.org/10.32604/cmes.2026.080241

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

An Explainability-Aware Transformer Framework for Brain Tumor Segmentation and Classification Using MRI

Abstract

Keywords

References

Cite This Article

489

161

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link