Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection

Ghada Atteia; Abdulaziz Altamimi; Nihal Abuzinadah; Khaled Alnowaiser; Muhammad Umer; Yunyoung Nam; Yongwon Cho

doi:10.32604/cmes.2026.078510

icon Open Access

ARTICLE

Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection

Ghada Atteia¹, Abdulaziz Altamimi², Nihal Abuzinadah³, Khaled Alnowaiser⁴, Muhammad Umer^5,*, Yunyoung Nam⁶, Yongwon Cho^6,*

1 Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
2 Department of Computer Science and Engineering, University of Hafr Al-Batin, Hafar Al-Batin, Saudi Arabia
3 Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
4 Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
5 Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
6 Department of Computer Science and Engineering, Soonchunhyang University, Asan, Republic of Korea

* Corresponding Authors: Muhammad Umer. Email: email ; Yongwon Cho. Email: email

(This article belongs to the Special Issue: Exploring the Impact of Artificial Intelligence on Healthcare: Insights into Data Management, Integration, and Ethical Considerations)

Computer Modeling in Engineering & Sciences 2026, 147(1), 43 https://doi.org/10.32604/cmes.2026.078510

Received 01 January 2026; Accepted 03 April 2026; Issue published 27 April 2026

Abstract

Cardiovascular diseases (CVD) remain the leading cause of global mortality, making early and accurate diagnosis essential for improving patient outcomes. However, most existing deep learning approaches address cardiac image segmentation or disease classification independently, limiting their effectiveness in complex clinical decision-making scenarios. In this study, we propose an explainable spatio-temporal deep learning framework that integrates segmentation-guided representation learning with efficient temporal modeling for automated CVD detection. The proposed architecture incorporates the Segment Anything Model for Medical Imaging in 2D (SAM-Med2D) to achieve accurate cardiac structure segmentation, followed by Mamba-based temporal feature extraction and Transformer-driven spatial representation learning to capture both dynamic motion patterns and anatomical dependencies in cardiac imaging sequences. To enhance transparency and clinical trust, Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP) are employed to provide interpretable diagnostic insights. The framework is evaluated on three benchmark cardiovascular datasets, including EchoNet-Dynamic, CAMUS echocardiography, and UK Biobank cine cardiac magnetic resonance imaging (CMR). Experimental results demonstrate strong performance, achieving a Dice score of 91.20% for segmentation, an AUC of 95.50%, classification accuracy of 92.10%, and an MCC of 0.84, consistently outperforming multiple baseline methods. The proposed framework consistently outperforms baseline and existing methods, achieving approximately 3%–6% improvement in segmentation performance and 3%–4% improvement in classification accuracy across key evaluation metrics. The proposed approach offers a robust and explainable solution for automated cardiovascular disease detection, with significant potential to support reliable clinical deployment and improve diagnostic workflows in medical imaging practice.

Keywords

Medical imaging; explainable artificial intelligence; transformer; segmentation; cardiovascular disease detection

1 Introduction

CVDs represent one of the most prevalent and life-threatening health burdens worldwide. According to the World Health Organization (WHO), CVDs are the leading cause of death globally, accounting for approximately 17.9 million deaths each year, which represents nearly 32% of all global deaths. The increasing incidence of conditions such as coronary heart disease, heart failure, and cardiomyopathies highlights the urgent need for early and reliable diagnostic support. Early detection and timely intervention can significantly reduce mortality and improve patient outcomes, emphasizing the importance of developing accurate and automated diagnostic systems for cardiovascular disease analysis. Despite significant progress in medical imaging technologies such as echocardiography and cardiac magnetic resonance imaging (CMR), accurate interpretation of dynamic cardiac structures remains challenging due to patient-specific variability, heterogeneous anatomical patterns, and differences in acquisition protocols across clinical centers [1]. Traditional computer-aided diagnostic systems often fail to generalize across diverse populations and imaging modalities, limiting their clinical scalability [2]. Moreover, many artificial intelligence (AI)-based approaches suffer from data imbalance, limited robustness, and poor interpretability, which restricts physician trust and real-world deployment [3]. These challenges motivate the development of intelligent, explainable, and generalizable frameworks capable of delivering accurate cardiovascular disease detection across heterogeneous cardiac imaging datasets.

Over the last several years, various deep learning-based models of segmentation have been offered in order to solve the problem of cardiac structure demarcation. Other architectures like the Mask R-CNN architecture [4] and the Segment Anything Model (SAM) [5] have been shown to perform well in the segmentation of the heart structure of the medical images. Generative Adversarial Networks (GAN) [6] have instead been investigated more in the context of cardiovascular signal modeling and data augmentation, e.g., cardiac analysis and synthesis of ECG signals, and not anatomical cardiac segmentation. In turn, image-specific segmentation models are still more appropriate to be used in the accurate extraction of the cardiac structure. As an illustration, SAM will be able to do general segmentation but not do better in certain medical scenarios like cardiac MRI and echocardiography where grained boundaries and organ/specific prompts are necessary. Second, such models are typically weak to low-contrast, or pathological, variations, and thus will give false identification of myocardial and ventricular regions [7]. These weaknesses imply an adaptive segmentation strategy that is more precise, powerful, and contextual in a wide and clinically significant state of imaging, which should be applied.

Besides segmentation improvements, Transformer-based models [8] have also been extensively investigated in cardiovascular disease detection and classification because they can capture long-range spatial interactions by use of global self-attention. Although useful in modeling complex anatomical relationships in medical images, these architectures are usually memory-intensive in terms of the combination of computational resources and large-scale annotated data to be performance-optimal. Instead, Federated Learning (FL) [9] is a paradigm of decentralized training as opposed to a model architecture, enabling multiple clinical centers to collaboratively train models without the exchange of raw patient data and hence privacy. Although FL has benefits, it has issues associated with the heterogeneity of data, communication latency, and convergence when the data is not identically distributed [10]. Since these two approaches target the different components of the medical AI systems model architecture and training strategy, their distinct weaknesses may impact the diagnostic consistency in the different clinical settings. This encourages the creation of hybrid and explicable models that concentrate on the structural design of buildings but offer flexibility and openness to clinical implementation.

To address these concerns, this paper presents a hybrid Mamba-Transformer framework with the addition of SAM-Med2D segmentation and Grad-CAM-SHAP explainability to ensure robust and interpretable cardiovascular disease detection. The proposed architecture first uses SAM-Med2D to achieve automated and prompt-based cardiac region segmentation to provide precise and accurate extraction of cardiac structures of interest, including ventricles and myocardium. A hybrid detection module is then used to fuse Mamba state-space layers, which are efficient at modeling temporal dependencies, with Transformer attention blocks that extract spatial and contextual features to be used in creating multi-scale feature fusions that increase classification accuracy. Visualization of the most salient cardiac areas to the prediction made by Grad-CAM [11], and the significance of each input feature to the decision made by SHAP are further used to enhance the predictions in terms of transparency and interpretability. Besides ensuring an improvement in clinical confidence, this general and explainable hybrid model is an efficient procedure for improving the diagnostic precision of cardiovascular disease detection through interpretability, reproducibility, and data efficiency.

The main contributions of the paper include:

• Hybrid Spatio-Temporal Architecture: This paper proposes a novel Hybrid Mamba-Transformer framework that integrates Transformer-based spatial feature extraction with Mamba-based state-space temporal modeling to effectively capture both structural and sequential dependencies in cardiovascular medical imaging data.

• Segmentation-Guided Feature Learning: The proposed framework incorporates the SAM-Med2D to provide segmentation-aware feature representations, enabling improved anatomical localization and boundary-aware learning in cardiovascular image analysis.

• Explainable AI-Driven Clinical Decision Support: To enhance transparency and physician trust, the proposed framework incorporates Grad-CAM and SHAP-based explanation mechanisms, providing both visual localization and feature-level interpretability of clinically relevant cardiac regions.

• Cross-Dataset Validation with Strong Performance: The proposed approach is validated on three benchmark cardiovascular datasets (EchoNet-Dynamic, CAMUS, and UK Biobank CMR), achieving segmentation performance above 91% Dice while demonstrating consistent generalization across heterogeneous imaging modalities and providing approximately 3%–6% improvement in segmentation performance and 3%–4% improvement in classification accuracy compared with recent state-of-the-art approaches.

The rest of the paper is organized as follows: Section 2 reviews the related work in cardiovascular disease detection using deep learning. Section 3 describes the proposed Hybrid Mamba-Transformer framework, including the model architecture and the integration of SAM-Med2D, Mamba temporal modeling, and Transformer components. Section 4 presents the experimental setup, datasets, and evaluation metrics. Finally, Section 5 concludes the paper and discusses potential directions for future research.

2 Literature Review

Left ventricular ejection fraction (LVEF) is an important index of cardiac function that relies on correct segmentation of the left ventricle (LV). Current methods often fail to work well with small data sets and fail to generalize well. To solve these challenges, Wu et al. [12] have proposed LV-SAM, which is based on SAM-Med2D, with a multi-scale adapter, multimodal prompt encoder, and multi-scale decoder for accurate LV segmentation. The performance is further improved by an end-to-end automated prompt generation pipeline, where experiments using the CAMUS dataset give superior accuracy and an absolute correlation coefficient and minimum MAE of 5.016 for LVEF estimation. Gurusubramani and Latha [13] have introduced a hybrid GAN with semantic resonance for cardiac images synthesis with realistic and clinically relevant. The model is made of local and global generators innervated by pre-trained CNN classifiers for semantic accuracy. Based on the adversarial loss and classification loss, it obtains 98.96% accuracy. The proposed method achieves SSIM and PSNR values of 0.955 and 45.23, which are better than those of other methods. Naseer et al. [14] have focused on enhancing the prediction of cardiovascular disease (CVD) using multi-algorithm machine learning techniques. On various datasets such as Cleveland, Hungarian, Switzerland, Statlog, VA Long Beach, and a large 70k dataset of CVD, the proposed Hybrid Linear Regression Bagging Model (HLRBM) shows better performance. The model combines logical regression with bagging and applies pre-processing techniques such as the standard scaling and SMOTE methods for balanced learning. Experimental results demonstrate that HLRBM can be more accurate and reliable than other traditional models, including SVM, KNN, NB, RF, and LR, to assess the risk of CVD.

Cardiovascular disease (CVD) prediction can face major improvement with the help of sophisticated deep learning-based intelligent systems. By the system, a hybrid structure is proposed by Mandava [15], which introduces the powerful image feature extraction capability with Modified DenseNet201 (MDenseNet201) and the accurate image classification capability with Improved Deep Residual Shrinkage Network (IDRSNet). Using five benchmark UCI cardiac datasets, various pre-processing techniques, such as outlier detection, missing values, and data balancing, were employed to enhance the quality of the data. The accuracy of the proposed MDenseNet201-IDRSNet model is 99.12%, which not only outperforms the traditional method but also lays the foundation to provide an early diagnosis of CVD. Echocardiographic segmentation is an essential tool in the diagnosis of cardiac disease, but it is affected by noise, poor resolution, and complicated anatomy. A Large-Window Mamba Scale (LMS) module and hierarchical feature fusion-based U-shaped deep learning model is proposed by Yang et al. [16], to achieve accurate segmentation. The LMS module models the long-range dependencies, and the cascaded residual blocks perform multiscale feature extraction. Experiments on EchoNet-Dynamic and CAMUS data sets establish a new state of the art in accuracy and robustness over the current methods. Early diagnosis of cardiovascular disease (CVD) is important for the reduction of mortality and to improve outcomes. Sumon et al. [17] have presented CardioTabNet, a transformer-based model that used tab transformer architecture to extract and rank key clinical features using a random forest algorithm. The extra tree classifier achieved an accuracy of 94.1% and an accuracy under the receiver operating characteristic curve of 95% with the classical approach. SHAP and nomogram analyses to aid in interpretation: Validating proposed framework as a robust clinical decision support system.

Phonocardiogram (PCG) signals provide a non-invasive method for the diagnosis of coronary heart disease (CHD), the world’s top global killer. A hybrid Convolution-Transformer Neural Network (HCTNN) is proposed by Zhao et al. [18], which combines local feature extraction by CNNs and global representation by a pruned Vision Transformer (ViT). Pre-processed PCG signals are then converted to CWT spectrograms, and a reweighting fusion mechanism is used to integrate global features and local features for classification. The model has an accuracy of 94.24% compared to the others with ViT and advanced CNNs because the model is effective in CHD detection. A machine learning model was developed by Qi et al. [19], using data from diet data on antioxidants to predict cardiovascular disease (CVD) and cancer comorbidities. Based on NHANES, there were 29 antioxidant and 9 baseline features analyzed for analysis after preprocessing. Among several algorithms, we found LightGBM to be the best with an 87.9% accuracy and 0.951 AUC. SHAP analysis revealed important predictors, such as naringenin, magnesium, theaflavin, kaempferol and vitamin C. Sathi et al. [20] have introduced an interpretable ECG-based diagnostic model for the detection of ischemia and arrhythmias-key causes of CVD. Three benchmark ECG datasets, MIT-BIH Arrhythmia, European ST-T, and Fantasia, were combined to train various machine learning models. The histogram gradient boosting classifier had an accuracy of 90% and an Area Under the Curve of 0.99, 0.99, and 0.89 in healthy, ischaemic, and arrhythmic cases, respectively. An explainable AI analysis showed that ECG fiducial points (RR interval, QRS duration, QT interval, and ST segment) were found to be key points of diagnosis. Table 1 provides a concise comparison of representative cardiovascular imaging methods, highlighting their core techniques, evaluation datasets, and remaining limitations.

Some more recent works have investigated the ideas of cardiovascular image segmentation and disease diagnosis using deep learning architectures. As an example, Lin et al. [21] have proposed a cardiac structure segmentation network (deep spatial attention, DSA) that showed better spatial features recognition compared to worse time modeling. On the same note, Nazari et al. [22] used a WGAN architecture to analyse medical images, whereas Deng and Wu [23] used NCM-Net to classify cardiac images with convolutional embeddings. The methods are the major benchmarks that will be employed in drawing comparisons in the experimental assessment of the proposed Hybrid Mamba-Transformer framework.

Recent publications further highlight the rapid evolution of cardiovascular imaging research toward foundation-model adaptation and efficient spatio-temporal learning. Approaches such as SAM-Med2D, hybrid GANs, and hybrid ML systems have been used to enhance the analysis of cardiac structure and predict accurate results [24]. Transformer-based and Mamba-driven models for improved echocardiographic segmentation and clinical feature interpretation [25]. Hybrid CNN-Transformer models and antioxidant-based ML models improved the coronary and comorbidity detection, and ECG-based models improved the arrhythmia and ischemia detection [26]. The existing approaches suffer from limitations such as low generalization, high computational expense, and low interpretability. To fill up these gaps, a Hybrid Mamba-Transformer framework using SAM-Med2D segmentation and Grad-CAM-SHAP explainability is proposed. It represents a combination of multi-scale spatial-temporal learning, adaptive optimization, and explainable AI. This leads to better diagnostic precision, strength, and clinical reliability.

Recent studies have made progress in the CVD diagnosis using deep and machine learning models for segmentation, prediction, and classification. Approaches such as SAM-Med2D, hybrid GANs, and hybrid ML systems have been used to enhance the analysis of cardiac structure and predict accurate results. Transformer-based and Mamba-driven models for improved echocardiographic segmentation and clinical feature interpretation. Hybrid CNN-Transformer models and antioxidant-based ML models improved the coronary and comorbidity detection, and ECG-based models improved the arrhythmia and ischemia detection. The existing approaches suffer from limitations such as low generalization, high computational expense, and low interpretability. To fill up these gaps, a Hybrid Mamba-Transformer framework using SAM-Med2D segmentation and Grad-CAM-SHAP explainability is proposed. It represents a combination of multi-scale spatial-temporal learning, adaptive optimization, and explainable AI. This leads to better diagnostic precision, strength, and clinical reliability.

3 Proposed Methodology

In this section, we present the Hybrid Mamba-Transformer framework for cardiovascular disease detection, which integrates advanced techniques for both segmentation and classification. The proposed methodology combines SAM-Med2D for accurate segmentation, Mamba temporal modeling for efficient temporal feature extraction, and Transformer-based spatial learning to enhance diagnostic performance.

3.1 Data Acquisition and Preprocessing

Three publicly available cardiovascular imaging datasets (EchoNet-Dynamic [27], CAMUS [28]) and the UK Biobank Cardiac MRI (CMR) data set [29]) were utilized in the study to conduct a comprehensive evaluation and generalization. These data sets cover different imaging modalities, such as two-dimensional echocardiographic cine sequences and three-dimensional and time (3D + time) cardiac MRI, allowing the proposed hybrid Mamba-Transformer framework to be evaluated in different and heterogeneous acquisition conditions. Table 2 contains the major statistics of the datasets. EchoNet-Dynamic is a collection of 10,036 2D echocardiographic cine videos of 10,036 patients with ejection fraction and end-systolic and end-diastolic volume annotations and has a balanced gender representation. The CAMUS dataset consists of 2D echocardiography sequences, which were recorded in 500 patients in apical two-chamber and four-chamber views, and expertly annotated left ventricle, myocardium, and left atrium throughout cardiac phases. By contrast, the UK Biobank CMR dataset has about 5000 subjects where cine MRI acquisitions take the form of short-axis volumetric stacks sampled at various cardiac phases and offers a high-quality 3D + time representation of cardiac structure and motion in addition to clinical metadata. These datasets jointly cover complementary spatial, temporal, and contrast features, which help to evaluate the results multimodally, generalize the findings across domains, and allow the reproducibility of the experimental outcomes.

images

All the image sequences were evenly preprocessed before training the model. Each frame was resized to a fixed spatial resolution of 256 × 256 pixels, and intensity normalization was applied to have its values between the interval [0, 1]. To enhance the robustness and variability of the training data, standard data augmentation techniques of random rotations with a maximum of 15∘ rotation angle, horizontal and vertical flipping, contrast adjustment, and perturbation with Gaussian noise were applied [30]. Segmentation for CAMUS and EchoNet-Dynamic is standardized to binary or multi-class label maps, whereas the volume images of interest from UK Biobank were motion-corrected and synchronised in time to ensure a consistent cardiac phase sampling [31]. The data sets were divided into 70% training, 15% validation, and 15% testing sets, ensuring that the identity of patients did not overlap across data splits. These specially preprocessed data were used as a basis in SAM-Med2D segmentation, hybrid Mamba-Transformer feature extraction, and classification. The sample images from EchoNet-Dynamic, CAMUS, and UK Biobank CMR datasets before and after preprocessing are shown in Fig. 1.

images

Figure 1: Sample images from EchoNet-dynamic, CAMUS, and UK Biobank CMR datasets before and after preprocessing, showing normalization and resizing applied prior to analysis.

3.2 SAM-Med2D Segmentation

The proposed framework consists of a tailored SAM-Med2D segmentation module, which offers anatomically-directed cardiac area-delimiting over heterogeneous cardiovascular imaging modalities. This architecture, as seen in Fig. 2 combines image embeddings, prompt embeddings, and positional encodings via a two-way medical cross-attention network. The image encoder is used to construct 256 × 256 feature embeddings on the cardiac images, and the prompt encoder constructs sparse prompt embeddings and dense mask embeddings of anatomical guidance cues. These embeddings are reshaped to 4096 × 256 feature representations so that they can efficiently interact with spatial features and prompt information. It is then followed by the two-way cross-attention blocks, whereby the information can be exchanged between the image and prompt embeddings in the direction of query, key, and value projections, and refine the feature with the help of normalization and multi-layered perceptron (MLP) layers.

images

Figure 2: Architecture of the proposed SAM-Med2D segmentation module, illustrating the integration of image and prompt embeddings through two-way cross-attention and adaptive upscaling to generate high-resolution cardiac masks.

The mask decoder then makes use of mask tokens (4 × 256) and an IoU token (1 × 256) to generate predicted segmentation masks using a mask prediction head to produce low-resolution segmentation maps (4 × 256 × 256), which are refined by an adaptive upscaling module to produce high-resolution cardiac segmentation output. This structure allows the model to dynamically target anatomically important structures, including, but not limited to, the left ventricle (LV), left atrium (LA), myocardium, and background regions. The proposed SAM-Med2D module, through the explicit construction of the interplay between image characteristics and prompt-driven anatomical channels, can improve the strength of the segmentation and the localization of boundaries. The resulting segmentation masks have lower anatomically refined regions of interest (ROI), which are then input into the Hybrid MambaTransformer framework to learn downstream spatio-temporal features.

The segmentation process begins by jointly encoding spatial features and prompt information to create a semantically rich cardiac representation. The image encoder extracts low-level visual patterns, while the prompt encoder introduces anatomical priors for guiding segmentation.

Fe(x,y)=σ(∑c=1cWe(c)∗I(c)(x,y)+be)(1)

Pp=tan⁡h(Wt.Emb(Tp)+γPosEnc(Ωp)+bt)(2)

Ef=λ1Fe+λ2Pp+λ3(Fe⊙Pp)(3)

In these expressions, I(c) represents the cardiac input image with C channels, while We(c) and be denote convolutional weights and bias. Eq. (1) generates the encoded feature map Fe through convolution and activation σ(⋅). Eq. (2) constructs a prompt embedding Pp from the textual or coordinate token Tp, scaled by the positional encoding PosEnc(Ωp) and parameters Wt,bt,γ. The fusion in Eq. (3) employs adaptive coefficients λ1,λ2,λ3 to merge semantic and structural priors, producing Ef the fused feature tensor fed into the attention encoder.

To integrate local anatomical cues with global spatial dependencies, SAM-Med2D employs a multi-head attention mechanism enhanced with topological regularization. This component captures complex inter-pixel relations in cardiac imagery.

Qh=Wq(h)Ef,Kh=Wk(h)Ef,Vh=Wv(h)Ef(4)

Ah=Softmax(QhKh⊤dh+βAdj(S))(5)

Z=Concath=1H(AhVh)Wo+ηEf(6)

here, Qh,Kh,Vh correspond to query, key, and value projections for each attention head h∈[1,H], parameterized by weights Wq(h),Wk(h),Wv(h). The adjacency term Adj(S) in Eq. (5) regularizes spatial similarity using anatomical structure S, modulated by β. Eq. (6) concatenates all heads and introduces residual fusion with ηEf, yielding the refined feature tensor Z, which captures both short-range pixel coherence and long-range cardiac morphology continuity.

Following contextual refinement, SAM-Med2D predicts segmentation masks through hierarchical convolution and region-specific normalization, ensuring pixel precision and boundary stability.

M^(x,y)=σ(∑k=1KWm(k)∗Z(k)(x,y)+bm)(7)

Mroi(x,y)=M^(x,y).1Ωcardiac(x,y)∑(x,y)∈ΩcardiacM^(x,y)(8)

Mnorm=Mroi−μ(Mroi)σ(Mroi)+ε(9)

In these equations, M^(x,y) denotes the raw mask output computed via convolution (∗) in Eq. (7), and Mroi isolates the anatomical region of interest using indicator function 1Ωcardiac. Eq. (9) normalizes the region mask to zero mean and unit variance using mean μ(⋅) and standard deviation σ(⋅), stabilized by a small constant ε. This normalization ensures uniform scaling across patient data before loss evaluation and fusion with the classifier module.

To optimize segmentation accuracy, SAM-Med2D combines boundary-sensitive and overlap-based losses in a composite objective that adapts dynamically during training.

ℒDice=1−2∑x,yM^(x,y)Mgt(x,y)+δ∑x,yM^2(x,y)+∑x,yMgt2(x,y)+δ(10)

ℒIoU=1−∑x,yM^(x,y)Mgt(x,y)∑x,y[M^(x,y)+Mgt(x,y)−M^(x,y)Mgt(x,y)](11)

ℒtotal=ω1ℒDice+ω2ℒIoU+ω3‖∇M^−∇M^gt‖22(12)

Eq. (10) defines the Dice loss LDice that balances region overlap and shape preservation using the smoothing term δ. Eq. (11) expresses the IoU loss, penalizing misalignment between the prediction M^ and ground truth Mgt. The total loss in Eq. (12) adds a gradient-consistency term weighted by ω3, encouraging smooth boundary transitions. Together, these losses ensure fine-grained contour accuracy while maintaining global anatomical coherence.

Finally, SAM-Med2D employs adaptive optimization and fusion of multi-scale outputs to stabilize convergence and improve real-time segmentation performance.

θt+1=θt−ηt∇θtℒtotalvt+ε(13)

Oseg=∑s=1Sπs.Upsample(Mnorm(s))(14)

Sfinal(x,y)=argmaxc(Oseg(x,y,c))(15)

In this final stage, Eq. (13) models the parameter update rule, integrating adaptive learning rate ηt, moment estimation vt, and weight decay λ to emulate Adam W-like optimization behavior. The multi-scale fusion in Eq. (14) combines normalized masks Mnorm(s) across scales s with fusion coefficients πs, while Eq. (15) yields the discrete segmentation map Sfinal via channel-wise maximum activation. Together, these operations produce anatomically consistent and computationally efficient segmentation results, establishing SAM-Med2D as a powerful foundation for subsequent feature extraction and disease classification within the hybrid framework.

3.3 Hybrid Mamba-Transformer Feature Extraction

To overcome the limitation in understanding spatial context and modeling temporal dynamics in healthcare medical imaging, a novel Hybrid Mamba-Transformer Feature Extraction module is proposed in this research paper to achieve a synergistic integration of the state-space efficiency in Mamba layers and the global attention capability in Transformer blocks [17]. The Mamba component captures vertebrate myocardial sequential dependency between cardiac frames using the continuous time state transition and parameterized recurrence relations to give a robust formation of temporal representation of the myocardial motion and ventricular deformation patterns. Instead, in parallel, the Transformer subnetwork captures long-range spatial dependencies with the help of multi-head self-attention (MHSA), being able to contextualize the structural relationships between different regions in the heart [32]. To achieve unified learning, the two feature domains are fused by a cross-attention gating mechanism, where the state space outputs are taken as temporal keys, and the Transformer embeddings are taken as spatial queries, which ensures the bidirectional information exchange between motion and morphology representations. Architecture of the proposed Hybrid Mamba-Transformer Feature Extraction module showing integration of the Transformer-based spatial encoding and Mamba-based temporal state-space modeling for spatio-temporal cardiac feature representation can be seen in Fig. 3. The Transformer module in Fig. 2 was configured using a lightweight encoder design to balance performance and computational feasibility in clinical settings. Specifically, the embedding dimension and number of attention heads were selected following common practice in medical vision transformers, while the encoder depth was kept moderate to avoid overfitting on limited cardiac datasets. These parameters were empirically tuned on the validation set to achieve optimal accuracy efficiency trade-offs, ensuring robust spatial representation learning without excessive computational overhead.

images

Figure 3: Architecture of the proposed Hybrid Mamba-Transformer Feature Extraction module, illustrating integration of Transformer-based spatial encoding and Mamba-based temporal state-space modeling for spatio-temporal cardiac feature representation.

The temporal evolution of cardiac features is first modeled using a discretized state-space system that captures fine-grained dynamics across sequential frames.

ht+1=(I+ΔtAt+12(Δt)2At2)ht+ΔtBtxt+ΔtΣtεt(16)

yt=Ctht+Dtxt+Γht⊙xt+vt(17)

Ht=∑t=1T(Πk=tTΦk)Btxt+∑t=1T(Πk=tTΦk)Σtεt(18)

In these relations, ht represents the hidden temporal state, xt denotes the feature vector derived from segmented frames at time index t, and At,Bt,Ct,Dt,Γt encode time-varying Mamba parameters. Eq. (16) incorporates a second-order discretization of the matrix exponential, along with diffusion governed by covariance Σt and noise εt. Eq. (17) combines linear and bilinear interactions between ht and xt, plus a residual term vt, producing the instantaneous temporal response yt. Eq. (18) aggregates temporal influence across the entire sequence, with Φk=I+ΔtAk and Πk=tTΦk acting as a transition product, resulting in global temporal representation HT that embodies long-range motion patterns.

Spatial relations between myocardial regions are encoded through a Transformer-style attention mechanism defined in a high-dimensional tensor space.

Q=WQFS,K=WKFS,V=WVFS,F~s=FSEpos+Eview(19)

Aij=exp(QiKj⊤dkrij+u⊤Ri+v⊤Rj)∑j′=1Nexp(QiKj′⊤dkrij′+u⊤Ri+v⊤Rj′)(20)

Fsp=LN(F~s+∑h=1H(A(h)V(h))WO(h))(21)

In this formulation, Fs denotes spatial feature maps derived from SAM-Med2D outputs, while Epos and Eview inject positional and view-dependent information, respectively, as shown in Eq. (19). Eq. (20) defines attention coefficients Aij incorporating scaled dot-products, relative position bias rij, and directional encodings through vectors u,vand relational descriptors Ri,Rj. Eq. (21) combines multi-head attention outputs across heads h∈1,…,H, followed by layer normalization LN(⋅) to produce a spatially contextualized representation Fsp that reflects anatomical structure and view geometry.

Temporal and spatial representations are then fused through a bilinear and gated mechanism to form unified hybrid features.

Ftm=σ(Wtm[Ht⊕GAP(Fsp)]+btm)(22)

G=σ(Wg∗(Ht⊗1hw+Reshap(Fsp))+bg)(23)

Fhyb=φ(G⊙ℬ(Fsp,Ftm)+(1−G)⊙Fsp)(24)

In Eq. (22), global temporal descriptor HT is concatenated with global-average-pooled spatial features GAP(Fsp), followed by affine transformation through Wtm, btm and non-linearity σ(⋅), yielding compact temporal-semantic code Ftm. Eq. (23) constructs a gating tensor G by convolving the broadcasted temporal state HT⊗1hw with reshaped spatial features, filtered using kernel Wg and bias bg. Eq. (24) applies a bilinear operator ℬ(Fsp,Ftm) for feature interaction, then fuses it with spatial features through gate G and activation ϕ(⋅), leading to the final hybrid representation Fhyb.

To stabilize training and enhance discriminative structure in the hybrid features, an energy-inspired and spectrum-aware formulation is introduced.

εspec=∑k=1K∑i=1dk(λk,i−λ¯)2+τ∑k=1K‖Uk⊤Uk−I‖F2(25)

εhyb=12∑t=1T‖Fhyb(t)−F^hyb(t)‖22+γ∑t=2T‖Fhyb(t)−F^hyb(t)‖22(26)

F~=(Fhyb−∇Fhyb(ωspecεspec+ωhybεhyb))(27)

Eigenvalues λk,i and eigenvectors Uk of selected covariance operators derived from attention heads or temporal dynamics contribute to the spectral penalty Espec in Eq. (25), with λ¯ denoting their mean and τ controlling orthogonality regularization. Eq. (26) defines hybrid energy Ehyb, combining reconstruction discrepancy between Fhyb(t) and auxiliary estimate F^hyb(t) with temporal smoothness across consecutive time indices. Eq. (27) refines the hybrid representation through an energy-based update using gradients of a weighted combination of Espec and Ehyb, followed by layer normalization, producing a spectrally regularized tensor F~.

The final part of the hybrid feature extractor focuses on class-level encoding and optimization driven by a margin-enhanced objective.

z=Wp.(∑i=1NωiF~i)+bp,ωi=exp⁡(u⊤F~i+δ‖F~i‖22)∑j=1Nexp⁡(u⊤F~j+δ‖F~j‖22)(28)

pc=exp⁡(s(cos⁡(θc−mc)))∑j=1Cexp⁡(s(cos⁡(θj−mj)))(29)

ℒhyb=−∑c=1Cyclog⁡(pc)+λspecεspec+λhybεhyb+λw‖Wp‖22(30)

θt+1=θt−ηtm^tv^t+ε+λdecayθt(31)

Weighted hybrid attention pooling in Eq. (28) computes descriptor z using coefficients ωi that depend on both linear similarity and quadratic norm of local hybrid features F~i. Eq. (29) expresses class probabilities pc under an angular margin-based Softmax model with scale s and class-specific margins mc, where θc denotes the angle between the feature vector and the class weight vector. Eq. (30) defines the hybrid loss Lhyb, combining cross-entropy with spectral regularization, hybrid energy penalties, and weight decay on the projection matrix Wp. Eq. (31) updates parameters θt using bias-corrected moment estimates m^t,v^t in an Adam W-style rule with decay factor λdecay, ensuring stable convergence of the Hybrid Mamba-Transformer feature extractor.

3.4 Classification and Adaptive Optimization

The classification stage of the proposed framework integrates a multi-layer dense network that transforms the fused spatio-temporal embeddings from the Hybrid Mamba-Transformer module into final diagnostic predictions. The extracted feature maps are flattened and passed through two fully connected layers with ReLU activation, followed by a Softmax output layer that computes the probability distribution across cardiovascular disease categories. To achieve stable convergence and efficient gradient propagation, the model uses the AdamW optimizer that combines the adaptive moment estimation method of Adam and takes into account the decoupling of weight decay, which allows the control of overfitting in high-dimensional data in the field of medicine. Furthermore, the training process is also guided by a Focal Loss function, which is designed to overcome the class imbalance issue by dynamically scaling the loss for those samples that are difficult to classify while reducing the impact of easy-to-classify instances. The said adaptive weighting strategy allows the network to pay more attention to the minority pathological cases to enhance the sensitivity and specificity of diagnostic prediction. Through joint application of AdamW optimization and Focal Loss adaptation, the model is faster to converge, better to generalize, and has improved diagnostic robustness, transmitting an accurate probability score of a cardiovascular disease reflecting both a structural and a functional cardiac problem.

3.5 Explainable AI Integration

To guarantee the clinical interpretability and increase the transparency of the diagnosis, the proposed framework embeds a dual-level explainable AI (XAI) module consisting of Gradient-weighted Class Activation Mapping (Grad-CAM) and SHapley Additive exPlanations (SHAP). These complementary approaches allow spatial and feature-level interpretability to give insight into how and why the model arrives at a specific diagnostic decision for clinicians.

3.5.1 Grad-CAM Visualization

The Grad-CAM module produces saliency heatmaps by determining the gradient of the target class score with respect to the convolutional feature maps in the last hybrid encoder layer. This visualisation tends to focus on the most influential spatial regions within the cardiac images, typically the left ventricular walls, the septal regions, or the myocardial boundaries responsible for the disease classification outcome. By overlaying these heatmaps on top of the original images, the system provides clinically-interpretable visual cues in line with pathological features such as thickening of walls, regional motion abnormality, or dilation of chambers of the heart. The Grad-CAM outputs, therefore, play a role as a type of visual validation layer in order to bridge the divide between the functioning of deep model inference and the clinical understanding of it.

3.5.2 SHAP Feature Interpretation

In parallel, SHAP analysis offers another type of feature attribution, that of quantitative attribution, by calculating Shapley values that are designed to quantify the contribution of each feature to the final prediction that the model makes. This approach distributes the output probability amongst features, which typically provides an understandable explanation for the output (on the input and decision level). In the context of cardiovascular disease detection, SHAP finds out which of the extracted spatial-temporal features, such as ventricular motion indexes, shape deformations, or texture variations, have the greatest impact on the diagnoses. When paired with Grad-CAM, SHAP advances the interpretability spectrum from visual spatial information to numerical feature importance to offer a multipierce explainability layer to support clinician trust, need for transparency in diagnosis, and regulatory compliance in AI-assisted cardiology.

The Hybrid Mamba-Transformer Algorithm 1 is based on a fusion of segmentation, spatio-temporal feature extraction, and classification in a unified pipeline in cardiovascular disease detection. SAM-Med2D first segments the cardiac structures, and then Mamba and Transformer modules take the temporal features and spatial features that are fused using their temporal feature with cross-attention and a final prediction. The model is trained iteratively and evaluated by using a standard metric, Graduate-CAM, and SHAP gives the ability to interpret the learned representations.

images

4 Results and Discussion

In this part, we provide the experimental results of the Hybrid Mamba-Transformer model, which includes the performance on different datasets of both segmentation and classification. The discussion calls out some key insights, compares our approach with existing models, and examines the strengths and limitations of the model in real-world clinical scenarios.

4.1 Experimental Setup

The experiments were performed on a high-performance computing environment built on an Nvidia A100 GPU, 32 GB Ram and running on Ubuntu 20.04. The implementation of the deep learning models in the project was done using a PyTorch framework (version 1.10) and CUDA 11.2 for GPU acceleration. The data sets used for this study, which focused specifically on EchoNet-Dynamic, CAMUS, and UK Biobank CMR were divided into training, validation, and testing data sets in ratios of 70%, 15%, and 15%, respectively. For training purposes, the batch size was defined as a number of samples in each batch equal to 16, the learning rate was initialized as 0.0001, and the number of epochs was defined to be 50. The AdamW optimizer with weight decay to prevent overfitting and Focal Loss for class imbalance were used. Evaluation metrics used for segmentation were Dice Similarity Coefficient (Dice), Intersection over Union (IoU), and for classification Accuracy, Precision, Recall, F1-score, and AUC were used to see the performance of the model.

4.2 Segmentation Performance

SAM-Med2D was tested on three cardiovascular imaging datasets, which were EchoNet-Dynamic, CAMUS, and UK Biobank CMR. The baseline segmentation models, such as U-Net, SAM, and SAM-MyoNet, were tested at the EchoNet-Dynamic dataset, and it is possible to compare them directly as data dimensionality and annotation procedures are both comparable. As illustrated in the findings, SAM-Med2D performs better than the baseline models on EchoNet-Dynamic on all the major segmentation metrics, such as Dice, IoU, sensitivity, specificity, and HD95. Besides, SAM-Med2D demonstrates good and reproducible segmentation on CAMUS and UK Biobank CMR, which suggests that it is very stable with a wide variety of imaging modalities. Table 3 shows the segmentation results of SAM-Med2D on all three datasets and its relative analysis with the basic models on EchoNet-Dynamic with critical measures of Dice, IoU, sensitivity, and specificity. In Eq. (32), the performance gain (PG) represents the percentage improvement achieved by the proposed model compared with the baseline method. Here, Pproposed denotes the performance metric obtained by the proposed framework, while Pbaseline represents the corresponding metric achieved by the baseline model.

PG=Pproposed−PbaselinePbaseline×100(32)

The performance of the SAM-Med2D model was tested on three datasets, EchoNet-Dynamic, CAMUS, and UK Biobank CMR, and compared with baseline models (U-Net, SAM, and SAM-MyoNet) on the EchoNet-Dynamic dataset. SAM-Med2D shows consistent and improved performance against the baselines in all the key metrics with the highest Dice scores of 91.20% on EchoNet-Dynamic, 89.30% on CAMUS, and 91.50% on UK Biobank CMR. It also performed better than the U-Net (with 7.30% in dice score), the SAM (with 3.10%) and the SAM-MyoNet (with 2.30%) on the EchoNet-Dynamic data set and proved to have better boundary preservation and segmentation accuracy. Additionally, when compared to literature models like MDenseNet201-IDRSNet and CardioTabNet, SAM-Med2D maintains improvements over 3% with measure dice and IoU that continue to establish its effectiveness. These results identify the robustness and superior performance of SAM-Med2D in various types of datasets, a powerful model for cardiovascular image segmentation, especially for clinical applications.

The SAM-Med2D segmentation module was able to provide anatomically coherent and high-contrast segmentation masks for raw and preprocessed cardiovascular video frames. As shown in Fig. 4, the model is able to accurately delineate important heart structures despite changes in contrast, noise, and acquisition angles. Further improving the boundary consistency and minimizing the small segmentation artifacts, maximum-likelihood reconstruction brings good spatial understanding of the sequential frame. These qualitative results support the high Dice and IoU scores reported and underscore the reliability of the model when it comes to deal with heterogeneous echocardiographic and MRI data.

images

Figure 4: Qualitative segmentation results illustrating representative frames from echocardiographic cine sequences (EchoNet-Dynamic and CAMUS). The figure shows original input frames, corresponding SAM-Med2D segmentation outputs, and reconstructed masks, demonstrating accurate delineation of cardiac structures across different image qualities, acquisition views, and imaging modalities used in this study.

4.3 Quantitative Evaluation of Classification Performance

The Hybrid Mamba-Transformer model is superior compared to several standard models, such as CNN, Vision Transformer, Msv-Mamba, DenseNet, ResNet, and Xception on key classification measures such as Accuracy, Precision, Recall, F1-score, AUC, Specificity, Sensitivity, and MCC. Achieving an AUC of 95.50% and an MCC of 0.84 displays better performance in the diagnosis, especially in differentiating between classes. The model shows the best results with respect to Specificity (94.00%) and also Sensitivity (93.10%) so that it covers the presence or absence of cardiovascular diseases. Its Log Loss is the lowest (0.24), which seems to be well-calibrated probability predictions. Table 4 shows us the classification performance comparison between the Hybrid Mamba-Transformer model and baseline models for multiple metrics, such as Accuracy, Precision, Recall, F1-score, and AUC. When compared to the baseline models, it can be seen that the Hybrid Mamba-Transformer shows great improvements, especially in both AUC and MCC, which shows that it can deal with complicated data sets and can improve the recognition of classes, especially those belonging to the minority classes. These results make Hybrid Mamba-Transformer a very effective and reliable model for cardiovascular disease detection.

The proposed Hybrid Mamba-Transformer model performed better in terms of all the evaluation metrics. In Fig. 5, Classification Performance Hybrid Mamba-Transformer model. Confusion matrix calculated on combined test set (test set EchoNet-Dynamic, test set CAMUS, test set UK Biobank CMR) (a). (b) Comparison of the Hybrid model and baseline architectures in terms of the Area Under the Curve of the ROC curve. (c) Comparison of the MCC of all the models. (d) ROC curves to demonstrate the superior sensitivity vs. specificity trade-off obtained by the Hybrid Mamba-Transformer vs. the other models can be seen. The confusion matrix indicates a balanced distribution of true positives and true negatives, indicating the capability of differentiating between CVD and non-CVD cases is quite good in EchoNet-Dynamic, CAMUS, and UK Biobank CMR test sets. Compared to baseline models, the best discriminative power and robustness of the Hybrid model were assessed as high AUC (0.955) and MCC (0.84), which validated the actual clinical utility of the proposed hybrid design for prediction. The ROC curves further demonstrate consistently greater sensitivity for all false-positive rates, which illustrates how the hybrid spatio-temporal representation is effective in capturing the clinically relevant cardiac patterns. In all cases, these results validate a Hybrid Mamba-Transformer as a reliable and generalizable classifier for the detection of cardiovascular disease.

images

Figure 5: Classification performance Hybrid Mamba-transformer model. Confusion matrix calculated on combined test set (test set EchoNet-Dynamic, test set CAMUS, test set UK Biobank CMR) (a). (b) Comparison of the hybrid model and baseline architectures in terms of the area under the curve of the ROC curve. (c) Comparison of the MCC of all the models. (d) ROC curves to demonstrate the superior sensitivity vs. specificity trade-off obtained by the Hybrid Mamba-Transformer vs. the other models.

4.4 Ablation Study

In this ablation study, we assess the role of each module in the Hybrid Mamba-Transformer architecture by gradually adding modules to the fundamental network pattern. The base architecture is a Transformer-based feature-extraction backbone that provides spatial representations but lacks temporal modeling and segmentation guidance. After the Transformer block, the Mamba temporal modeling module is included, which captures sequential dependencies on the extracted features. The SAM-Med2D segmentation module is then presented to offer segmentation-sensitive spatial representations that offer a better localization of boundaries and an improvement in the precision of segmentation. Table 5 shows the ablation results of the segmentation performance, where it shows the increase in the performance results as each module is gradually connected to the Hybrid Mamba-Transformer architecture. On the same note, Table 6 also shows the ablation study of the classification performance where each of the modules contributes to the overall diagnostic ability of the model. Although the baseline performance has been mentioned above, this part is devoted to the gradual increase in performance when more modules are added to the architecture.

images

The ablation analysis clearly demonstrates that every module plays an important role in the overall performance of the Hybrid Mamba-Transformer model. Starting from the baseline (already discussed), when adding the Transformer block, it adds to the spatial feature extraction, while Mamba temporal modeling enhances the ability to learn the sequential dependency. The full Hybrid Mamba-Transformer model with all modules integrated is able to provide the highest performance in all aspects, which shows that integrating Mamba and Transformer can perform effective spatio-temporal learning. This hybrid approach provides an improvement in both learning stability and accuracy, and hence could be a robust solution for complex tasks like cardiovascular disease detection.

4.5 Visual and Explainability Analysis

The Grad-CAM and SHAP visualizations illustrate the clinically relevant regions of the human heart that the Hybrid Mamba-Transformer model uses to predict CVD. As can be seen in Fig. 6, both approaches have consistently focused on the left ventricular walls, septal areas, and myocardial boundaries-regions typically evaluated by cardiologists as the site of functional and structural abnormalities. The heatmaps show strong attention to motion-sensitive and morphology critical zones over multiple frames, showing that the model is making decisions that have an anatomical basis as opposed to spurious artifacts. This correspondence between model and clinically relevant attention aligns the clinical reliability and interpretability of this proposed framework in the real world.

images

Figure 6: Explainability results with the use of Grad-CAM and SHAP. (a) Processed illustrations of echocardiograms. (b) Grad-CAM emphasizes discriminative regions in the cardiac images. (c) SHAP saliency maps with the feature contribution. Both methods highlight anatomically relevant structures to be used by the model for CVD prediction.

4.6 Comparative Analysis with Existing Studies

This section aims to place our proposed model, the Hybrid Mamba Transformer, in comparison to the recent state-of-the-art models in CVD detection and diagnosis. Table 7 shows us the performance comparison of the Hybrid Mamba-Transformer with existing models on segmentation tasks on key metrics, including Dice, IoU, Sensitivity and Specificity and Table 8 shows us the performance comparison of the Hybrid Mamba-Transformer against existing Models on classification tasks, point out Accuracy, Precision, Recall, F1-Score and AUC. We compare the diagnostic performance of our model against five models from the recent literature, summarising key metrics, as well as pointing out our contributions across the modalities of hybridisation, interpretability and generalisation.

images

From the comparative outcome, it proves the efficiency of our Hybrid Mamba-Transformer model, which outperforms the previously used segmentation and classification model with a Dice score of 91.20% and AUC of 95.50%, showing a measured improvement of 4%–6% compared to the previous segmentation and classification methods. The temporal and spatial fact integration via Mamba and Transformer blocks boosts both the segmentation and the classification, whereas the addition of Grad-CAM and SHAP explainability helps ensure that the model is transparent. Additionally, the generalization of our model across multiple data sets and imaging modalities and the use of Focal Loss to mitigate the effect of class imbalance improve recall and sensitivity, making this deep learning model a reliable tool for clinical diagnostics. On the whole, the Hybrid Mamba-Transformer is here to set a new standard in cardiovascular image analysis by putting together segmentation, spatio-temporal learning, and interpretability in a frozen structure.

4.7 Computational Efficiency and Robustness

This part measures Hypercube Hybrid ptiles’ efficacy and sturdiness with key measures, such as time per sample for the integer, model parameters, and FLOP (Floating Point Operations Per Second). These characteristics are of significant importance to learn more about the complexity and the computational cost of the model, but also about its deployability for clinical use. In addition, we perform sensitivity tests to understand the robustness of the model under different kinds of real-world distortions (noise, image rotation, and changes in resolution). Computational efficiency comparison between Hybrid Mamba-Transformer and other state-of-the-art models in terms of inference time, model parameters, and FLOPs can be seen in Table 9. The tests deal with difficulties usually faced in medical imaging, such as noise, the rotation of the images, and varying the quality of an image. The comparison with state-of-the-art models demonstrates the trade-off between model complexity and clinical deployability, due especially to the complex components of the model in the spatial and temporal aspects, and interpretability.

images

The Hybrid Mamba-Transformer model has a slightly increased inference time (210 ms) at a sacrifice of simpler models such as CardioTabNet (150 ms) and MDenseNet201-IDRSNet (140 ms) when compared to the model. This contributes to the enhanced performance, such as spatio-temporal learning and interpretability features (essential for clinical decision-making). The model’s 85 million parameters and 150 billion FLOPS are much higher than simpler models, such as CardioTabNet (with 35 million parameters and 50 billion FLOP), needed to enable the model’s high-level functionality of segmentation and classification. These increases in computational cost make it possible for this model to undertake complex tasks with high levels of accuracy, making it ideal for clinical environments where the ability to be transparent and perform well is essential. Table 10 shows us the sensitivity test results proving the Hybrid Mamba-Transformer robustness to noise, rotation, and resolution change, revealing its capability to keep its high performance under different imaging conditions.

images

For testing the quality of the Hybrid Mamba-Transformer, we performed sensitivity testing by varying various real-world scenarios depending on noise, image rotation, and resolution. These tests simulate typical issues that medical imaging systems encounter, including noise snapping and variation in the image quality due to rotation or image quality degradation.

The Hybrid Mamba-Transformer model shows great potential under various distortions, including Gaussian noise, rotation, and resolution reduction. Despite a small reduction in performance (e.g., Dice score decreased to 89.30% with noise), the model could achieve fairly high sensitivity (88.20%) and specificity (90.10%), indicating that the model performed well with noisy and rotated images. Even with elevated distortions such as increased noise or rotation, the model’s AUC was still high, which suggests the generalization ability and robustness of the proposed model and can be well-suited for clinical applications where imaging conditions are varied. Two aspects of robustness, which are complementary in the proposed framework, are illustrated in Fig. 7. The lack of amalgamation between CVD and non-CVD prediction scores clearly demonstrates the stability of the probability outputs and high probability confidence in all the test data sets shown in the left panel. The distribution of frames per sequence for EchoNet-Dynamic, CAMUS, and UK Biobank CMR is shown in the right panel, which highlights the great temporal variability of these 3 datasets. Despite this heterogeneity, there is consistent performance from the model, even showing good robustness to sequence length, frame rate, and computational load variations.

images

Figure 7: Gradable schemes for computational robustness evaluation of the proposed framework. (a) Distribution of predicted CVD and non-CVD probability scores. Strong class separability. (b) Variation in the number of frames per sequence in the three data sets springs from distance (e.g., temporal variability, stable model performance for variation in computational burden).

5 Conclusion and Future Work

In this paper, we have proposed the Hybrid Mamba-Transformer framework for cardiovascular disease detection, which effectively integrates SAM-Med2D for accurate segmentation, Mamba temporal modeling for efficient temporal feature extraction, and Transformer spatial feature extraction for robust representation learning. Infra architecture has also integrated Grad-CAM and SHAP (Model interpretability) to provide transparent and actionable insights and tell clinicians about their model. Our model showed that it is better at both segmentation and classification compared to state-of-the-art models in terms of several important metrics such as Dice, AUC, and MCC, across various datasets, such as EchoNet-Dynamic, CAMUS, and UK Biobank CMR. The spatio-temporal learning abilities of the Mamba model enable the framework to learn both temporal and spatial dependency in the medical images, which is essential to operations like dynamic cardiac MRI analysis. The Transformer backbone helps to extract spatial features better, and this helps to improve the robustness of the model in identifying complex patterns in the given imaging data. Despite the added complexity, the performance boost of the model offers a good trade-off in terms of the computational cost, thus making it very appropriate for clinical environments where accuracy and interpretability are very important. The explainability features further enhance its clinical applicability and ensure that medical professionals can have trust and validation in the model’s predictions, which is a key requirement for adoption in the real-world clinical setting.

While the Hybrid Mamba-Transformer shows good performance, it also has certain defects, such as comparatively high computational costs and longer inference time relative to simpler models, which affect its deployment to real environments with limited available time. Future work will be spent on providing an even more optimized framework by looking into more advanced optimization possibilities to reduce these computation costs and inference time while still providing high accuracy in the diagnosis. Additionally, combining federated learning may improve data privacy and make it possible for the model to be learned across various medical centres around the world without compromising data security. Another area of potential success is the investigation of multi-modal methods that use both imaging information and information from clinical examination and genetic profile. Finally, the capability of the model to handle a broader spectrum of cardiovascular conditions and longitudinal data will allow for further improvement in its clinical usefulness in monitoring progressive disease and planning tailored treatment.

Acknowledgement: We would like to thank Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R748), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia for funding this research.

Funding Statement: This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00218176) and the Soonchunhyang University Research Fund. Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R748), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Ghada Atteia; methodology, Ghada Atteia, Muhammad Umer and Abdulaziz Altamimi; software, Muhammad Umer, Khaled Alnowaiser and Nihal Abuzinadah; validation, Yunyoung Nam; formal analysis, Yunyoung Nam and Yongwon Cho; investigation, Ghada Atteia; data curation, Muhammad Umer and Abdulaziz Altamimi; writing—original draft preparation, Ghada Atteia, Muhammad Umer, Nihal Abuzinadah and Abdulaziz Altamimi; writing—review and editing, Khaled Alnowaiser, Yunyoung Nam and Yongwon Cho; visualization, Nihal Abuzinadah; supervision, Yunyoung Nam and Yongwon Cho; project administration, Yunyoung Nam and Yongwon Cho; funding acquisition, Yunyoung Nam and Yongwon Cho. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The dataset can be accessed from the following link: EchoNet-Dynamic: https://echonet.github.io/dynamic/index.html#dataset; CAMUS-Human Heart Data: https://www.kaggle.com/datasets/shoybhasan/camus-human-heart-data; UK Biobank Cardiac MRI: https://community.ukbiobank.ac.uk/hc/en-gb/articles/27830032450461-Cardiac-Magnetic-Resonance-Imaging-Derived-Phenotypes-CMR-IDPs; the dataset can also be requested from the corresponding authors.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Tarek Z, Alhussan AA, Khafaga DS, El-Kenawy ESM, Elshewey AM. A snake optimization algorithm-based feature selection framework for rapid detection of cardiovascular disease in its early stages. Biomed Signal Process Control. 2025;102:107417. doi:10.1016/j.bspc.2024.107417. [Google Scholar] [CrossRef]

2. Kiran S, Reddy GR, Dorthi K. A gradient boosted decision tree with binary spotted hyena optimizer for cardiovascular disease detection and classification. Healthc Anal. 2023;3:100173. [Google Scholar]

3. Marengo A, Pagano A, Santamato V. An efficient cardiovascular disease prediction model through AI-driven IoT technology. Comput Biol Med. 2024;183:109330. doi:10.1016/j.compbiomed.2024.109330. [Google Scholar] [PubMed] [CrossRef]

4. Hoorali F, Khosravi H, Moradi B. An automatic method for microscopic diagnosis of diseases based on URCNN. Biomed Signal Process Control. 2023;80:104240. doi:10.1016/j.bspc.2022.104240. [Google Scholar] [CrossRef]

5. Ying Y, Fang X, Zhao Y, Zhao X, Zhou Y, Du G, et al. SAM-MyoNet: a fine-grained perception myocardial ultrasound segmentation network based on segment anything model with prior knowledge driven. Biomed Signal Process Control. 2025;110:108117. [Google Scholar]

6. Wang Z, Stavrakis S, Yao B. Hierarchical deep learning with Generative Adversarial Network for automatic cardiac diagnosis from ECG signals. Comput Biol Med. 2023;155:106641. doi:10.1016/j.compbiomed.2023.106641. [Google Scholar] [PubMed] [CrossRef]

7. Singh A, Nagabhooshanam N, Kumar R, Verma R, Mohanasundaram S, Manjith R, et al. Deep learning based coronary artery disease detection and segmentation using ultrasound imaging with adaptive gated SCNN models. Biomed Signal Process Control. 2025;105:107637. doi:10.1016/j.bspc.2025.107637. [Google Scholar] [CrossRef]

8. Rehman A, Naijie G, Ojo S, Nathaniel TI, Samee NA, Umer M, et al. FISM: harnessing deep learning and reinforcement learning for precision detection of microaneurysms and retinal exudates for early diabetic retinopathy diagnosis. BioData Min. 2025;18(1):75. [Google Scholar] [PubMed]

9. Manocha A, Sood SK, Bhatia M. Federated learning-inspired smart ECG classification: an explainable artificial intelligence approach. Multimed Tools Appl. 2025;84(19):21673–96. doi:10.1007/s11042-024-20084-3. [Google Scholar] [CrossRef]

10. Wang X, Hu J, Lin H, Liu W, Moon H, Piran MJ. Federated learning-empowered disease diagnosis mechanism in the internet of medical things: from the privacy-preservation perspective. IEEE Trans Ind Inform. 2022;19(7):7905–13. doi:10.1109/tii.2022.3210597. [Google Scholar] [CrossRef]

11. Raghavan K, Sivaselvan B, Kamakoti V. Attention guided grad-CAM: an improved explainable artificial intelligence model for infrared breast cancer detection. Multimed Tools Appl. 2024;83(19):57551–78. [Google Scholar]

12. Wu Y, Zhao T, Hu S, Wu Q, Chen Y, Huang X, et al. Integrating multi-scale information and diverse prompts in large model SAM-Med2D for accurate left ventricular ejection fraction estimation. Med Biol Eng Comput. 2025;63(7):2161–71. doi:10.1007/s11517-025-03310-4. [Google Scholar] [PubMed] [CrossRef]

13. Gurusubramani S, Latha B. Enhancing cardiac diagnostics through semantic-driven image synthesis: a hybrid GAN approach. Neural Comput Appl. 2024;36(14):8181–97. doi:10.1007/s00521-024-09452-0. [Google Scholar] [CrossRef]

14. Naseer A, Khan MM, Arif F, Iqbal W, Ahmad A, Ahmad I. An improved hybrid model for cardiovascular disease detection using machine learning in IoT. Expert Syst. 2025;42(1):e13520. doi:10.22541/au.169358589.99602470/v1. [Google Scholar] [CrossRef]

15. Mandava M. MDensNet201-IDRSRNet: efficient cardiovascular disease prediction system using hybrid deep learning. Biomed Signal Process Control. 2024;93:106147. [Google Scholar]

16. Yang X, Wang Q, Zhang K, Wei K, Lyu J, Chen L. Msv-mamba: a multiscale vision mamba network for echocardiography segmentation. IEEE Trans Comput Soc Syst. 2025:1–13. doi:10.1109/TCSS.2025.3562441. [Google Scholar] [CrossRef]

17. Sumon MSI, Islam MSB, Rahman MS, Hossain MSA, Khandakar A, Hasan A, et al. CardioTabNet: a novel hybrid transformer model for heart disease prediction using tabular medical data. Health Inf Sci Syst. 2025;13(1):44. [Google Scholar] [PubMed]

18. Zhao W, Ma H, Jin N, Zheng Y, Guo X. Detection of coronary heart disease based on heart sound and hybrid vision transformer. Appl Acoust. 2025;230:110420. doi:10.1016/j.apacoust.2024.110420. [Google Scholar] [CrossRef]

19. Qi X, Wang S, Fang C, Jia J, Lin L, Yuan T. Machine learning and SHAP value interpretation for predicting comorbidity of cardiovascular disease and cancer with dietary antioxidants. Redox Biol. 2025;79:103470. doi:10.1016/j.redox.2024.103470. [Google Scholar] [PubMed] [CrossRef]

20. Sathi TA, Jany R, Ela RZ, Azad A, Alyami SA, Hossain MA, et al. An interpretable electrocardiogram-based model for predicting arrhythmia and ischemia in cardiovascular disease. Results Eng. 2024;24:103381. doi:10.1016/j.rineng.2025.104070. [Google Scholar] [CrossRef]

21. Lin J, Xie W, Kang L, Wu H. Dynamic-guided spatiotemporal attention for echocardiography video segmentation. IEEE Trans Med Imaging. 2024;43(11):3843–55. doi:10.1109/tmi.2024.3403687. [Google Scholar] [PubMed] [CrossRef]

22. Nazari M, Emami H, Rabiei R, Rabiee HR, Salari A, Sadr H. Enhancing cardiac function assessment: developing and validating a domain adaptive framework for automating the segmentation of echocardiogram videos. Comput Med Imaging Graph. 2025;124:102627. [Google Scholar] [PubMed]

23. Deng X, Wu H. Echocardiography video segmentation via neighborhood correlation mining. IEEE Trans Med Imaging. 2025;44(12):5172–82. doi:10.1109/tmi.2025.3588157. [Google Scholar] [PubMed] [CrossRef]

24. Dong H, Gu H, Chen Y, Yang J, Chen Y, Mazurowski MA. Segment anything model 2: an application to 2D and 3D medical images. IEEE Trans Biomed Eng. 2026:1–17. doi:10.1109/TBME.2026.3653267. [Google Scholar] [PubMed] [CrossRef]

25. Arif S, Son SH, Kim HY, Kim SC, Lee JY. A diagnosis tool for early detection and classification of heart disease in individuals using transformer mechanisms. Comput Methods Programs Biomed. 2026;277:109248. doi:10.1016/j.cmpb.2026.109248. [Google Scholar] [PubMed] [CrossRef]

26. Mahmood AH, Hasan TM. A custom dilated-separable CNN for automated cardiovascular disease detection using electrocardiogram images. Archit Image Stud. 2026;7(1):1484–98. [Google Scholar]

27. Ouyang D, He B, Ghorbani A, Lungren MP, Ashley EA, Liang DH, et al. Echonet-dynamic: a large new cardiac motion video data resource for medical machine learning. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 8–14; Vancouver, BC, Canada. [Google Scholar]

28. Davi S, Kumar M, Hanif ZM, Kumar A, Kumari M, Ridham F, et al. Deep learning for early detection of cardiovascular diseases from medical imaging. Health Sci Rep. 2025;8(10):e71334. doi:10.1002/hsr2.71334. [Google Scholar] [PubMed] [CrossRef]

29. Salatzki J, Condurache DG, D’Angelo S, Salih AM, Szabo L, Mahmood A, et al. Rheumatoid arthritis and cardiovascular disease associations in the UK Biobank. BMC Med. 2025;23(1):605. doi:10.1093/eurheartj/ehaf784.4210. [Google Scholar] [CrossRef]

30. Singh G, Darji AD, Sarvaiya JN, Patnaik S. Preprocessing and frame level classification framework for cardiac phase detection in 2D echocardiography. Biomed Signal Process Control. 2025;107:107803. [Google Scholar]

31. Yu T, Chen K. Enhancing cardiac disease detection via a fusion of machine learning and medical imaging. Sci Rep. 2025;15(1):26269. doi:10.1038/s41598-025-12030-6. [Google Scholar] [PubMed] [CrossRef]

32. Jabbar MK, Jianjun H, Jabbar A, Rehman ZU. Mamba-based VoxelMorph framework for cardiovascular disease imaging and risk assessment. IEEE Access. 2025;13:78120–37. doi:10.1109/access.2025.3564962. [Google Scholar] [CrossRef]

33. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin/Heidelberg, Germany: Springer; 2015. p. 234–41. [Google Scholar]

34. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023 Oct 2–3; Paris, France. p. 4015–26. [Google Scholar]

35. Patra R, Dutta S, Roy IK, Basak P, Ghosh A. Heart disease detection using vision-based transformer ensemble models. Procedia Comput Sci. 2025;258:3554–69. doi:10.1016/j.procs.2025.04.611. [Google Scholar] [CrossRef]

36. Alsayat A, Mahmoud AA, Alanazi S, Mostafa AM, Alshammari N, Alrowaily MA, et al. Enhancing cardiac diagnostics: a deep learning ensemble approach for precise ECG image classification. J Big Data. 2025;12(1):7. [Google Scholar]

Cite This Article

APA Style

Atteia, G., Altamimi, A., Abuzinadah, N., Alnowaiser, K., Umer, M. et al. (2026). Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection. Computer Modeling in Engineering & Sciences, 147(1), 43. https://doi.org/10.32604/cmes.2026.078510

Vancouver Style

Atteia G, Altamimi A, Abuzinadah N, Alnowaiser K, Umer M, Nam Y, et al. Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection. Comput Model Eng Sci. 2026;147(1):43. https://doi.org/10.32604/cmes.2026.078510

IEEE Style

G. Atteia et al., “Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection,” Comput. Model. Eng. Sci., vol. 147, no. 1, pp. 43, 2026. https://doi.org/10.32604/cmes.2026.078510

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Explainable Segmentation-Guided Mamba-Transformer Framework for Automated Cardiovascular Disease Detection

Abstract

Keywords

References

Cite This Article

245

107

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link