Open Access
ARTICLE
AMVT-NMN: Adaptive Multi-Scale Vision Transformer with Neuromorphic Memory Networks for Enhanced Lung Cancer Detection
1 School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin, China
2 College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
3 Interdisciplinary Research Centre for Finance and Digital Economy, King Fahd University of Petroleum and Minerals, Dharan, Saudi Arabia
4 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
5 EIAS Data Science & Blockchain Laboratory, College of Computer and Information Science, Prince Sultan University, Riyadh, Saudi Arabia
6 Department of Computer Science, Ethiopian Defence University, Bishoftu, Ethiopia
7 Department of Electrical and Electronics Engineering, Pusan National University, Busan, Republic of Korea
* Corresponding Authors: Kashish Ara Shakil. Email: ; Longwen Wu. Email:
(This article belongs to the Special Issue: Advanced Computational Intelligence Techniques, Uncertain Knowledge Processing and Multi-Attribute Group Decision-Making Methods Applied in Modeling of Medical Diagnosis and Prognosis)
Computer Modeling in Engineering & Sciences 2026, 147(1), 42 https://doi.org/10.32604/cmes.2026.080279
Received 05 February 2026; Accepted 25 March 2026; Issue published 27 April 2026
Abstract
Lung cancer accounts for the highest number of cancer deaths globally, underscoring the urgent need for early and precise detection to enhance patient outcomes. While deep learning has made remarkable strides in analyzing medical images, current approaches face a fundamental challenge. They cannot adequately capture detailed local patterns and broader contextual relationships within lung Computed tomography (CT) scans. To address this limitation, we introduce AMVT-NMN (adaptive multi-scale vision transformer with neuromorphic memory networks), which combines three complementary mechanisms. The dynamic adaptive kernel networks component intelligently adjusts receptive field sizes based on input characteristics, enabling flexible feature capture across multiple scales. The neuromorphic contextual memory attention module draws inspiration from how human memory systems process information, maintaining a dynamic record of diagnostically relevant patterns to inform current predictions. The hierarchical cross-scale fusion mechanism with learnable weights synthesizes information from different resolution levels through adaptive weighting. Testing on the Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases (IQ-OTHNCCD) dataset demonstrates strong performance: 97.9% accuracy, 96.5% sensitivity, 98.7% specificity, and 99.2% Area under the Curve-Receiver Operating Characteristic (AUC-ROC). These results surpass existing methods such as CNN-GD, which achieved 97.2% accuracy. Notably, the high specificity translates to fewer false alarms, potentially reducing unnecessary biopsies and follow-up imaging outcomes that matter considerably in clinical practice. Result of AMVT-NMN generalization to the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI), Lung nodule analysis (LUNA16), and Non-Small Cell Lung Cancer (NSCLC)-Radiomics datasets showed AUCs of 96.5%, 92.8%, and 97.2%, respectively. Ablation experiments confirm that each architectural element of AMVT-NMN contributes meaningfully to overall performance. Five-fold cross-validation yielded consistent results (97.71 ± 0.57%), indicating reliable performance across different patient subsets. The memory-augmented design shows particular promise for handling diagnostically ambiguous cases. It is focused on pattern recognition and computational intelligence, which is useful for coping with uncertain information in intelligent diagnosis systems, meeting the growing trend for trusted artificial intelligence (AI) in decision-making.Keywords
Lung cancer remains the deadliest form of malignancy worldwide, accounting for about 25% of total cancer deaths and nearly 1.8 million deaths annually [1]. The large differential in survival rates between 60% in local malignancy and a mere 6% in cases with metastatic malignancy indicates the criticality of prompt detection [2]. Computed tomographic scanning has been shown to possess the capability to reduce lung malignancy-related mortality rates by 20%–30% in selected at-risk populations [3]. However, in manual readings, CT presentations are plagued with several drawbacks. There is considerable variation in readings between different observers (kappa values of 0.40–0.65), and a 25%–30% rate of missed detection in early malignancies, particularly those of 5 mm or less in size [4]. In addition, a typical chest CT scan involves 200–300 slices of images to be processed in a day, which can lead to mental fatigue and inaccurate readings. Importantly, a mind-boggling 95% of the nodules identified are benign [5].
Although deep learning significantly improved the state of affairs pertaining to the analysis of health images, traditional methods have limitations of their own [6]. Lung nodules exhibit considerable morphological variability, ranging from 3 mm round lesions to 30 mm spiculated masses, with subsolid nodules representing a clinically distinct subtype requiring specialised management strategies [7]. Despite achieving 94%–97% accuracy on benchmark datasets, CNNs with fixed receptive fields struggle to simultaneously capture both small subtle nodules and large spiculated lesions within a single feature extraction pathway [8,9]. Despite CNNs’ proficiency in capturing local information about patterns, they are incompetent in identifying far-off dependencies necessary to distinguish between lesions and anatomical lookalikes in distant parts of images [10]. Vision transformers remedy this weakness with self-attention, but require 10–100 times more training data, and computational costs proportional to quadratic space with high-resolution health images [11,12]. Current CNN-transformer models try to integrate the best of both, but do so through hard-coded feature fusion methods that disregard differences in the characteristics of various lesions, while also being unable to learn from past cases that are similar to earlier ones [13,14].
Motivated by these constraints and inspired by models of human memory from neuroscience literature [15,16], we introduce the adaptive multi-scale vision transformer with neuromorphic memory networks (AMVT-NMN). Our proposed methods remedy three shortcomings in current models. First, previous models use multiple scales processed by either fixed-size kernels or parallel streams, whereas the size variations in lung nodules are extremely large to be modelled using fixed receptive fields [17]. Second, current deep learning models process each image independently, whereas studies have shown that human radiologists outperform standalone algorithms—suggesting that clinical expertise involves contextual reasoning beyond single-image analysis [18]. Third, attention-based feature weighting has demonstrated superior performance over fixed combination strategies in complex classification tasks [19].
This section analyzes prior research on the detection of lung cancer, organized under four main topics including CNN-based approaches, vision transformers, hybrid CNN-transformer models, and attention mechanisms.
2.1 CNN-Based Approaches to Lung Nodule Detection
The precise characterization of pulmonary nodules is a cornerstone in thoracic radiology, providing direct insights into essential management steps according to major clinical guidelines [7]. A major driving force behind medical imaging techniques has been the prevalence of convolutional neural networks, following the breakthrough in the classification of the ImageNet dataset. Early methods for lung nodule detection included feature extraction by techniques of human design, such as shape and texture features, and conventional learning models for classification. Deep learning reinforced computer vision’s capability for end-to-end learning of images without human intervention [6]. Res-Net-based models reached an accuracy of 94%–96% in medical benchmarks by employing skip connections [8], which allow gradients to back propagate through deeply layered networks. This was followed by Dense-Net, which enhanced feature re-use through dense connectivity patterns, reaching an accuracy of 95.8% [20]. Recent advances have focused on dual models for detecting lung cancer, combining CNNs for the extraction of hierarchical features, and a multi-scale detection system for handling different sizes and positions of cancer cells [21]. Such constitute examples of how multiple models can be used together to solve some of the tricky issues that come along with diagnosis. However, the lack of long receptive fields in CNNs prevents them from learning long-range spatial dependencies necessary to correctly categorize lungs from blood vessel patterns or pleural plaques in medical images [10]. This challenge is addressed in multi-scale learning as in U-Net feature pyramid networks [17], which partly addresses the long receptive field problem by performing processing at multi-resolution images while employing human-crafted fusion strategies not necessarily optimal for medical imaging [22]. Multi-scale dilated convolution networks take this idea a step further by using dilations to increase the receptive fields. This allows the model to detect features at different scales while maintaining computational efficiency in medical image classification tasks [23]. The subsequent research focused on the implementation of more advanced forms of fusion, even with more computerized techniques, particularly for application in lung nodule analysis. For 2D images, a dual kernel CNN with dual feature fusion (DKCDF) [24] has been proposed to enhance the detection accuracy. For 2D images, a dual feature fusion-based 3D CNN [25] has been implemented to enhance the accuracy of classification. Recent innovations include attention-enhanced detection models that integrate self-attention mechanisms with convolutional architectures for improved nodule recognition [26].
2.2 Vision Transformers in Medical Imaging
Vision Transformers brought the mechanism of self-attention to computer vision, allowing a global receptive field since the first layer [11]. The baseline vision transformer (ViT) breaks an image into patches and lets the patches pass through a sequence of transformer layers, which consist of multi-head self-attention mechanisms [27]. The medical imaging field has seen the usage of ViT for different tasks [12,28–30] such as image-based disease classification and image segmentation. The vanilla version of the ViT model is scarcely used in medical imaging research, due to its requirement of an immutable number of images (millions of images). Whereas, conventional medical imaging research uses a limited number of images (tens of thousands), considering the cost associated with the data annotation process and the privacy needs of patients [31]. Another observation from existing literature is that the computational complexity associated with the attention mechanism for higher resolution CT images (resolution: 512× or above) restricts its applicability to regular computing hardware [32]. To address the computational complexity challenge associated with vanilla ViT models, the Swin Transformer introduced a hierarchical structure and the concept of a shifted window [33,34]. Although this new design achieved relative success compared to the original ViT, it still demands many samples compared to CNNs.
More recent approaches involve the hybridization of the local inductive bias and global modelling power using transformers [11,34]. Trans-U-Net uses CNN architectures for the extraction of low-level features and the addition of global context using the transformer [35]. Swin-Unet applies the Swin Transformer for the medical segmentation task based on the U-shaped architecture [34,36]. CoTr proposes the idea of convolutional transformers that include the fusion mechanisms of convolution and attention [37]. This concept is further enhanced in MA-UNet, where features from different scales are combined using hybrid attention in a two-stage encoder architecture. This is an example of how the fusion of coarse- and fine-grained feature extraction, coupled with the use of optimal transformers [38], can be beneficial. However, the mentioned methods have the tendency to concatenate or sum the feature vectors from the parallel branches based on fixed operations, even though the optimal fusion strategy depends on the characteristics of the lesion, quality, and context [14]. Furthermore, they treat all images the same, even though the computed mean scan volumes contain less than 1% lung nodule presence, thereby increasing the computation cost [39]. Most importantly, these highlighted architectures ignore the prior from similar images [15]. In recent work, an attention-enhanced hybrid architecture demonstrated superior performance, by combining Inception-Next with transformer-based attention mechanisms for lung cancer detection [40].
2.4 Attention Mechanisms and Memory Networks
Attention mechanisms emerged as a ubiquitous phenomenon in deep learning, which enabled models to selectively attend to important areas and filter out unwanted information. Squeeze-and-Excitation Networks proposed channel attention for dynamic feature weight adjustment to improve resilience to occlusion [41]. Convolutional block attention module (CBAM) integrates channel and spatial attentions for effective feature representation, which requires less computational overhead compared to SE-Net's channel attention mechanism [42]. In recent times, global attention and context encoding approaches have been a focal point of global interest in the field of medical image segmentation. These approaches demonstrate the potential of using a combination of multi-scale context encoding and global attention blocks to effectively model long-range semantic relationships, and also minimize the loss of spatial detail in deep networks [43]. Nonetheless, these models are image-local and lack any external memory storage that can permanently store data for referral [44]. Memory Networks, which originally emerged as a paradigm for natural language processing, are designed to utilize external memory storage to store and retrieve data [15,45]. In other domains, recent studies pursue the utilization of memory-augmented models for few-shot learning and a continuum of learning [46,47], but medical imaging literature scarcely explores this direction. Consequently, this work seeks to bridge this gap by proposing neuromorphic memory systems [16], which offer the possibility of prototypical pattern accumulation on the basis of training samples for biologically plausible case-based reasoning, and match trends in explainable and prototype-based models in medical informatics [45,48].
Although previous works show enormous progress in identifying lung nodule patterns with CNN architectures, vision transformers, and their combination [12,29], there still exist pivotal shortcomings in managing spatial variability, dynamic fusion of features, and contextually informed inferential reasoning in these frameworks. The traditional CNN approaches are still marred by their fixed receptive fields [49], while transformers require a huge amount of training data and computational power [31,50]. Hybrid approaches, on the other hand follow rigid fusion schemes that do not incorporate patterns from previous cases based on memory [15]. In this respect, there is a serious drawback in not incorporating several prototypical patterns from previous cases through external memory frameworks in order to remember prototypical node patterns, which is a natural process in decision-making under healthcare settings [15,48]. The improvements in lung nodule and cancer detection tasks through enhanced feature extraction and multi-scale processing methods [26,40] are notable efforts of deploying attention mechanisms in medical imaging research.
To extend existing knowledge in domain literature, this paper makes the following contributions:
• A dynamic adaptive kernel network (DAKN) with the capability to learn kernel size, enabling it to handle spatially variant receptive fields, as it adapts to the attributes of input images.
• A neuromorphic contextual memory attention (NCMA) mechanism, which preserves an episodic memory bank of typical presentation patterns of modules, where biologically inspired case-based reasoning is used.
• A hierarchical cross-scale fusion with learnable weights (HCSF-LW) with attention-guided adaptive fusion, instead of a conventional fixed fusion approach.
• Comprehensive test of proposed model on the IQ-OTHNCCD dataset and record state-of-the-art performance with accuracy improvements regarding detection of small nodules, reduction of false positives, and interpretability.
• A comprehensive ablation study showing the contribution of each component of the proposed hybrid model and the feasibility of its application in real-world clinical diagnosis.
3.1 IQ-OTHNCCD Lung Cancer Dataset
The IQ-OTHNCCD (Iraqi-Oncology Teaching Hospital/National Center for Cancer Diseases) dataset was collected at two oncology centers in Iraq and represents one of the few publicly available lung cancer CT datasets with patient-level annotations.
The IQ-OTHNCCD lung cancer dataset [51], with 1097 samples in Joint Photographic Experts Group (JPG) formats, is therefore adopted in this study. The patient’s data are divided into three categories of diagnosis, including benign (120), malignant (561), and normal (416) cases. The study is based on 2D axial CT slices, stored as 512 × 512 pixel images in JPG format. Each of the images contains information regarding the location of the nodules, the type of nodules (solid, part-solid, or ground-glass), and whether they are malignant or not.
In the dataset, there is considerable variability observed, with 58% of nodules being solid, 27% part-solid, and 15% being ground-glass opacities. There is also anatomical variability, with 28% of cases having complex.
The LIDC-IDRI dataset [52] refers to the Lung Image Database Consortium and Image Database Resource Initiative dataset. It comprises a total of 1018 CT scans of 875 patients. The images are collected from different sources. The nodules in this dataset are marked by four radiologists. The nodules are classified as malignant, benign, and normal, based on the areas of the annotated masks. The entire 875 patient data are used in this study, following a 70/15/15 stratified division as in Section 3.
The LUNA16 dataset [53], introduced in 2016, consists of 888 CT scans, each divided into 10 subsets and stored in MHD format. Each scan has an associated annotations file that includes the coordinates and diameters of annotated nodules in millimetres, accepted by at least 3 out of 4 radiologists. Unlike clinical datasets with histological confirmation, LUNA16 does not provide benign or malignant labels. Therefore, for the purpose of enabling multi-class classification experiments in this work, a size-based proxy labelling scheme was adopted following established clinical risk stratification thresholds from the Fleischner Society guidelines [5]: nodules with diameter ≥6.0 mm were assigned a proxy malignant label, nodules with diameter between 3.0 and 6.0 mm were assigned a proxy benign label, and CT slices without annotated nodules were labelled as normal. It is explicitly acknowledged that this size-based proxy labelling does not represent confirmed histological malignancy and is adopted solely to facilitate experimental evaluation. For cross-validation purposes, all 10 subsets are combined.
The NSCLC-Radiomics dataset [54] contains 422 samples of non-small cell lung cancer, provided as DICOM files from various clinical sites. Since all the samples are confirmed NSCLC, the data set consists of only two types: malignant and normal, with no benign samples, thus making it a simple binary classification problem. All 422 samples were divided using the same 70/15/15 splitting approach.
Stratified splitting was performed at the patient level to avoid data leakage between the subsets, and to ensure that there is a balanced distribution of benign, malignant, and normal samples in every subset. The splitting resulted in training, validation, and test sets containing 768 (8064 images after augmentation), 165 (2304 images), and 164 (1152 images) samples, respectively.
For cross-dataset generalization tests, the same 70/15/15 split ratio is used for the LIDC-IDRI, LUNA16, and NSCLC-Radiomics datasets independently.
A unified, streamlined preprocessing workflow was developed to guarantee stable model training and good generalization performance across different clinical datasets, as shown in Fig. 1.

Figure 1: The four-stage pre-processing pipeline for the AMVT-NMN model.
The preprocessing of the IQ-OTHNCCD dataset is a four-stage protocol that begins with image resizing, where each image is resized to a standard 128 × 128 pixels. Next, CLAHE is employed to enhance the contrast between different tissues, which is useful for highlighting the boundary details of the nodules. The lung regions are next segmented from the rest of the image. These regions are then extracted using a binary mask, while other regions like the chest wall, are eliminated. Finally, the image is normalized to the range [0, 1] using min-max normalization based on the 1st percentile and 99th percentile of the pixel intensity distribution, as shown in Eq. (1).
Let the original pixel intensity in the CT image be denoted as
The LIDC-IDRI, LUNA16, and NSCLC-Radiomics datasets, stored as digital imaging and communication in medicine/meta image header (DICOM/MHD), are similarly preprocessed. First, the image is converted from its raw pixel value to Hounsfield Units (HU). Next, an intensity windowing operation is applied to the image, where the standard window level is −600 HU, and the standard window width is 1500 HU. This operation eliminates other anatomical structures, allowing for better visualization of the lung tissue. Finally, the image is normalized to the range
In order to increase data variety and ensure that the model remains robust against irrelevant changes, we employed a series of on-the-fly data augmentations during training, as depicted in Fig. 2. The data augmentation is performed by applying spatial perturbation techniques, including rotating images by 90, 180, or 270 degrees, along with horizontal and vertical flipping. Additionally, we applied photometric augmentations involving changes to brightness and contrast. It is noteworthy that data augmentation was performed only during training; the validation and testing data were kept intact.

Figure 2: Example of data augmentation transformations applied during training.
The adaptive multi-scale vision transformer with neuromorphic memory networks (AMVT-NMN) seamlessly merges three newly proposed elements into a single model. The hybrid includes the dynamic adaptive kernel networks (DAKN) for non-uniform feature extraction, neuromorphic contextual memory attention (NCMA) for reasoning based on episodic memory retrieval, and hierarchical cross-scale fusion with learnable weights (HCSF-LW) for adaptive multi-resolution fusion. The proposed model follows the encoder-decoder architecture with skip connections capable of processing an input image in different scales while maintaining the context. Fig. 3 shows a visual representation of the architecture.

Figure 3: The adaptive multi-scale vision transformer with neuromorphic memory networks.
4.1 Dynamic Adaptive Kernel Networks (DAKN)
Conventional CNNs use a constant receptive field size, usually 3 × 3 or 5 × 5, at all locations. In reality, lung nodules show enormous variability in their sizes, necessitating different sizes of receptive fields depending on their location, where small spiculated nodules demand a granular resolution provided by 3 × 3 or 5 × 5 kernels, whereas large spiculated nodules require 7 × 7 or 9 × 9 kernels to probe their details. Our proposed dynamic adaptive kernel networks learn these sizes adaptively.
DAKN is a three-stage network. In the first stage, a light-weight prediction network makes a prediction of the optimal kernel size for a given image location by analyzing local image properties (texture entropy, gradient magnitude, and intensity variance) in a 5 × 5 patch. The prediction network is a two-layer convolutional network with 64 and 32 channels, followed by a spatial soft-max layer that gives a probability distribution over the kernel size possibilities K = 3, 5, 7, 9, 11. In the second stage, dynamic kernel generation based on interpolation weights is applied. Instead of storing weight matrices independently for different kernel sizes (which is parameter-inefficient), we use base kernels of size 3 × 3 and 11 × 11. Interpolation sizes are obtained by smoothly learning the kernels. This consumes 60% less parameters but still preserves representational capacity.
Thirdly, spatial adaptability via convolution will be implemented based on the learned kernels. For every spatial position
where
4.2 Neuromorphic Contextual Memory Attention (NCMA)
Human radiologists develop the capability to diagnose through the training of diverse nodule appearances and relying on similar cases mostly from memory while viewing the images. Inspired by the accounts of episodic memory found in the study of neuroscience, we propose the neuromorphic contextual memory attention that possesses a learnable memory bank of diverse nodule patterns. NCMA defies the norm of traditional attention models which perform their computations within the same images rather than between images.
N prototype vectors of size D, produced during learning in the training procedure and stored in memory bank
Phase 1: Local feature representation
Phase 2: The attention scores between query and all memory prototypes are computed:
where
Eq. (3) calculates a soft-attention distribution over memory prototypes. A larger weight
Phase 3: The weighted sum of the memory prototypes is calculated:
where
Phase 4: Feature augmentation is obtained by adding query features to the retrieved prototypes:
where
This is a function that augments local features
The update rule of the memory bank is performed using gradient descent in the training process to ensure that the prototypes learn to match the features in the dataset. Importantly, we use the update rule with momentum and exponential moving average with a decay rate of
where
This facilitates memory consolidation, whereby each prototype develops over time through the application of the moving average process. High values of β ensure the retention of the previously gained knowledge, while lower β values cause quicker adjustments for novel patterns.
This improves the ability to identify less common types of nodules through the application of biologically plausible case-based reasoning and the use of accumulated knowledge.
4.3 Hierarchical Cross-Scale Fusion with Learnable Weights (HCSF-LW)
Multiscale feature fusion plays an important role in detecting nodules across different size ranges. Previous methods depend on fixed fusion techniques, such as concatenation and addition, which utilize uniform weighting irrespective of lesion types. Our hierarchical cross-scale fusion with learnable weights addresses this limitation by using attention-guided adaptive fusion that adapts weighting dynamically from different scales.
HCSF-LW processes features at four scales: 1/4, 1/8, 1/16, and 1/32 of original resolution to capture information from fine texture to coarse shape. For each scale pair
Step 1—Spatial alignment: This up-samples lower-resolution features to match those of higher resolution, utilizing bilinear interpolation with learned refinement.
Step 2, a fusion weight prediction by a lightweight network, analyzes the features concatenated
where
A lightweight network makes predictions for a spatial weight map
This weight tensor has the same spatial dimensions as features, hence allowing pixel-wise adaptive fusion.
Step 3—Weighted fusion: features are fused by:
where
Hierarchical fusion features at multiple scales are progressively fused, closing off gaps at neighboring scales before closing gaps at larger scales. Residual Connection: The output is given by
where
4.4 Overall Architecture and Training
The architecture of the AMVT-NMN consists of the entire combination of the modules within the encoder-decoder framework. The encoder consists of four DAKN blocks that include sequentially larger channels (64, 128, 256, and 512) and smaller spatial resolutions (1/4, 1/8, 1/16, and 1/32). Within the DAKN blocks, the operations of the adaptive kernel convolution, batch normalization, rectified linear unit (ReLU) nonlinearity, and NCMA memory-related mechanism of memory-guided feature extraction are conducted. Stride convolution is used between DAKN blocks to downsample the characteristics while retaining the spatial information. The decoder is the reverse of the encoder and maintains symmetry, utilizing the transposed convolution to perform up sampling. Skip connections are used to pass the features from the encoder at all scales to the decoder, which is conducted by the HCSF-LW mechanism.
Training uses a hybrid loss function with the following components:
A hybrid loss jointly optimizes primary classification accuracy while regularizing the novel components-memory and fusion. This makes sure that the auxiliary modules learn meaningful, task-relevant representations without degradation in core detection performance.
Here,
We train our AMVT-NMN model on Keras with TensorFlow backending on NVIDIA RTX-4090 cards with 24 GB memory. We use Kaiming initialization for convolutional layers and Xavier initialization for attention layers. Prototypes within memory banks are initialized with k-means clustering with k = 256 clusters on training data features obtained from an earlier encoded model. We perform mixed precision training with FP-16 to speed up computations, with loss scaling to preserve stability. Training time is around 48 h for 150 epochs. The number of trainable parameters in the proposed AMVT-NMN model amounts to 0.50 million as shown in Table 1. The model requires approximately 20.90 GFLOPs for each forward pass. The average time taken for each image in inference mode using NVIDIA RTX 4090 GPU is 7.90 ± 1.00 ms, which equates to approximately 131 images per second for a batch size of 16. The peak memory usage of the GPU during inference is approximately 152 MB.

We evaluate our results using proper, clinically useful metrics. For classification, we check accuracy, sensitivity (also known as recall), specificity, precision, F1 scoring, and AUC-ROC. To determine the practicality of model results in real-world clinical settings, the false positive predictions are critically analyzed, as these are the major causes of unnecessary testing and patient distress. The efficiency of our algorithm is also examined in terms of the number of parameters in millions in determining its feasibility within the clinical world.
The overall performance of the AMVT-NMN model is outstanding on all metrics, compared to existing baselines on the IQ-OTHNCCD test set, shown in Table 2.
Our results visualizations in Figs. 4–10 indicate that the performance of AMVT-NMN surpasses prior methods based on 2D lung cancer image information analysis in all reported metrics. AMVT-NMN obtains 97.9% accuracy, 96.5% sensitivity, 98.7% specificity, 98.2% precision, 97.3% F1-score, and 99.2% AUC. It is also important to note that the 99.2% AUC reported in Table 2 is the macro-averaged AUC on the test set, while the 99.6% AUC is the value obtained on the validation set for the full AMVT-NMN model. This 0.4% difference is due to differences in evaluation sets.

Figure 4: Confusion matrix of our method from the training.

Figure 5: Performance of each class in terms of precision, recall, F1-score and AUC.

Figure 6: Training performance curves in terms of accuracy, recall, precision and loss.

Figure 7: ROC curves per class.

Figure 8: precision-recall curves per class.

Figure 9: Confidence distribution for each class.

Figure 10: Calibration curves for each class.
The key specificity that distinguishes the sensitivity of 98.7% is effectively reducing the number of false positives, which in turn leads to minimizing unnecessary follow-up tests and reducing patient anxiety, which is an important aspect in its practical application.
Table 2 shows the performance of the methods on individual types of cancers. In scenarios where accuracy matters the most, i.e., in malignancies, the performance of AMVT-NMN reaches 96.5% on the sensitivity metric. This outperforms DenseNet121-SVM (90.3%), ResNet50 (93.0%), and DCSwinB (90.6%), by margins ranging from 3.5% to 6.2% at the least. For benign and normal data, each model exceeds 95% sensitivity, and it appears that the characterization of architectures lies in the detection of nuanced signs of malignancy. This is where the advantage of the AMVT-NMN model is derived, from the adaptability of the receptive fields in the DAKN network to maintain nuances in the texture, and the memory component in NCMA to emphasize typical malignancy features.
To test the robustness of the proposed model in ablation mode, we remove one layer at a time, beginning with the full AMVT-NMN design. The results obtained from the ablation experiments are presented in Table 3.

The full AMVT-NMN design attains an accuracy of 97.9%. The removal of DAKN shows the hardest hit scenario, where accuracy falls to 83.4% (down 14.5%). This clearly reveals the importance of DAKN in understanding adaptation multi-scale features. Removing NCMA reduces the accuracy to 84.2% (down 13.7%). This clearly indicates the importance of NCMA due to its memory-based contextual attention. Removing HCSF-LW resulted in accuracy dropping to 92.3% (down 5.6%).
Further method evaluation includes cross-validation with a factor of 5. From the results obtained and presented in Table 4, the values remain stable with an average accuracy of 97.71% and an interval of ±0.57% (ranging from 96.67% to 98.21%). A small standard deviation value (σ = 0.0057) confirms that all sets of data performed well.

6.4 Cross-Dataset Generalisation Experiments
To verify that AMVT-NMN generalizes well to other datasets aside from the main IQ-OTHNCCD dataset, we also trained and evaluated this model architecture on three other lung cancer datasets from different sources that have different image acquisition protocols: LIDC-IDRI, LUNA16, and NSCLC-Radiomics. The experimental results are presented in Table 5 as well as in Fig. 11. In these experiments, the network structure maintains the form earlier described in Section 4; only the datasets were changed.


Figure 11: A cross-dataset comparison.
As shown in Table 5, our model reaches an accuracy of 86.7% on LIDC-IDRI, which contains 875 patients’ data from different institutions with multi-reader annotations for nodule detection. AMVT-NMN also attained a high accuracy of 76.3% on LUNA16, which contains 888 scans of volumetric MHD CT images for a three-class problem. Furthermore, our model also attained a high accuracy of 91.5% on NSCLC-Radiomics, which contains 422 non-small cell lung cancer images in DICOM format. Note that NSCLC-Radiomics contains only malignant and normal images; no benign images were present for this dataset.
Further on Table 5, AMVT-NMN attain high AUC values in all datasets, including 99.2% in IQ-OTHNCCD, 97.2% in NSCLC-Radiomics, 96.5% in LIDC-IDRI, and 92.8% in LUNA16. These results provide empirical evidence that justifies AMVT-NMN’s generalizability to heterogeneous data from different sources with different image acquisition protocols, different types of imaging equipment, different image formats, etc., making it a promising model for real-world applications.
To analyze how NCMA uncovers patterns in memory, we investigated memory prototypes that NCMA learned to identify. By applying t-distributed stochastic neighbor embedding (t-SNE) (Fig. 12) to the 256-dimensional memory bank, distinct classes are formed, and we identified that there are five natural prototypes, which belong to Cluster 1 with 28.1%, Cluster 2 with 22.7%, Cluster 3 with 19.1%, Cluster 4 with 16.4%, and Cluster 5 with 13.7%.

Figure 12: T-SNE visualization.
In a quantitative examination of the activation maps, class-specific preferences in the model can be seen. Malignant samples have stronger average activation weights (0.74 ± 0.12) compared to benign samples (0.68 ± 0.15) and normal samples (0.42 ± 0.18), representing the learning capability of the system to distinguish between classes without any labeling of the features of texture and shape.
In the minority validation set, NCMA performs well in the minority subset (28.1% of data) with a high accuracy of 94.7%, validating the learned usage of prototype patterns for the minority data.
Error analysis (Figs. 13–15) reveals the following key observations. Firstly, the malignant class is noteworthy with the lowest error rate at 0.4% and the highest accuracy at 99.6%, thereby establishing the reliability of the model for the most critical class. Secondly, the majority of errors occur between normal and benign classes due to similarities in structural features, with error rates at 3.6% and 2.6%, respectively. Thirdly, the confidence analysis shows that the model is either correct with an average confidence of 0.987 ± 0.025 or incorrect with an average confidence of 0.859 ± 0.118. Here, the possibility of establishing a 95% confidence level as a practical solution in the clinical environment is evident. Predictions with confidence levels above this would have an accuracy rate of 97.7%, while those with lower confidence levels have significantly higher error rates that would need the intervention of a radiologist.

Figure 13: Error analysis per class.

Figure 14: Confidence distribution of error analysis.

Figure 15: Misclassified image from error analysis.
6.7 Attention Visualization and Interpretability
NCMA Attention Analysis: The attention maps outline clear diagnostic cues. The Normal tissue instance in Fig. 16 attracts the highest attention, whose mean is approximately 0.723 with a concentration of around 60.5%, indicating that the model constantly focuses its attention on healthy lung parenchyma. The benign cases reach a moderate attention score, 0.644, with a higher spread (±0.090), with a concentration of around 49.5% (±19.6%), which denotes the presence of several shapes and morphologies of lesions. In malignant cases, attention turns out to be focused and stable around the tumor regions and their irregular edges.

Figure 16: Attention map that drive region prediction.
DAKN Kernel Selection Analysis: kernel choices are consistently multi-scale across all groups. The model favors coarse 7 × 7 kernels to capture structural context, uses fine 3 × 3 kernels to pick up texture details, and relies on medium 5 × 5 kernels as transitional features. The near uniform selection across the pathology types-variation below 5% implies that its discriminative power comes from combining hierarchical multi-scale features rather than tuning to any single class. It points to the fact that DAKN learns generalizable representations that apply to a range of lung tissue pathologies.
AMVT-NMN has been found to perform well in the classification of 2D CT slices with an accuracy of 97.9% and sensitivity of 96.5%. Therefore, it can be considered as a potential tool for computer-aided diagnosis (CAD). However, the statements regarding the detection of the disease in the early stages and the migration of the stages would need to be verified using 3D volumetric images.
By showing graphical representations of areas of attention of the AI model and the memory retrieval of related cases, the model is credible and trustworthy among healthcare practitioners. Radiologists are able to inspect concrete areas of attention of the model while analyzing specific cases to understand the model's conclusion on the malignancy of a potentially suspicious lesion. AI-human collaboration is the model’s strategy that amplifies human capabilities rather than augmenting them; therefore, this is ethically sound within the domain of AI-assisted healthcare practices.
7.2 Limitations and Future Work
The promising results of AMVT-NMN should be viewed with the following six caveats in mind.
First, the AMVT-NMN operates on 2D axial slices instead of 3D volumes. As mentioned in the introduction, the natural 3D nature of lung nodules is a reality. It is also true that the spatial information in the slices is not sufficient when viewed in isolation. Models such as 3D CNN-RNN are able to leverage this spatial information in the 3D volumes. Therefore, a direct comparison of the AMVT-NMN with 3D models is not a fair comparison because of the difference in the input dimensions. Extending the AMVT-NMN to operate on 3D volumes is an interesting direction for future research. However, this would also mean a significant increase in the computational cost.
Secondly, the generalizability of AMVT-NMN is validated on four different datasets IQ-OTHNCCD, LIDC-IDRI, LUNA16, and NSCLC-Radiomics, each generated from different hospitals, and constituting diverse imaging modalities, imaging protocols, and patient populations. The experiments are still conducted in a retrospective manner.
Third, the size of our memory bank (256 prototypes) reflects a capacity-efficiency trade-off; larger banks can capture finer-grained patterns at greater retrieval cost. Fourth, although attention maps are interpretable, they are post-hoc explanations rather than a guarantee of sound reasoning. A more comprehensive evaluation of the explanation quality using radiologist assessment will strengthen our model results further. Fifth, clinical translation requires prospective validation in routine clinical environments with varying patient demographics and scanner types, along with seamless integration into current picture archiving and communication systems. Sixth, in comparison with a lightweight CNN approach, the proposed multi-scale transformer architecture, combined with neuromorphic memory module, increases computational cost by a moderate amount, reaching 20.90 GFLOPs per image. The computation is performed in approximately 7.90 ms on an NVIDIA RTX 4090. This is a suitable amount of computation for a clinical workstation, but could require additional optimization if this model were to be deployed on other platforms that are less well-equipped. Future directions will include an exploration of model compression techniques such as knowledge distillation, structured pruning, and quantization-aware training that reduce computational cost without compromising diagnostic accuracy.
Some potential future research directions include: (1) Multi-task learning framework that detects nodules, segments their boundaries, and predicts malignancy risk in an end-to-end fashion; (2) Temporal modelling for longitudinal analysis across serial CT examinations; (3) Uncertainty quantification in order to provide confidence estimates for clinical decision-making; (4) Integration of clinical data, such as age, smoking status, and biomarkers, for holistic risk estimation; and (5) Few-shot learning capabilities to rapidly generalize to rare subtypes of nodules or new institutions with limited labelled data.
This paper introduces AMVT-NMN, a new deep learning algorithm for diagnosing lung cancer. The algorithm combines adaptive processing capability and a neuromorphic memory component to improve the analysis of medical images. The three innovations in this work include dynamic adaptive kernel networks to adaptively extract features in space, neuromorphic contextual memory attention to perform reasoning in an episodic memory, and hierarchical cross-scale fusion with learnable weights to adaptively integrate multiscale features. These innovations fix three fundamental issues with the hybrid CNN-transformer models used in prior approaches to medical image analysis. Results on the IQ-OTHNCCD dataset include an accuracy of 97.9% and sensitivity of 96.5% on challenging nodules.
In addition to performance improvement over existing baselines, AMVT-NMN brings capability to visualize attention, memory recall, and adaptive kernel mapping, to the transparent medical AI research domain. The system follows biologically plausible principles guided by human diagnostic logic and visual memory. The ablation experiments showed the effectiveness of every component in the hybrid setup, as well as the interaction between adaptive kernel learning, memory attention, and fusion. The performance improvement in rare nodules shows the effectiveness of memory-driven logic in correcting the imbalance in clinical data and performing detection.
The research has a wide range of clinical applications, particularly nodule detection, which will enable curative treatment and it has the potential to contribute towards diagnostic pathways pending prospective validation. Real-time post-processing and outputs enable translation to a clinical setting as a supportive diagnostic system to complement a radiologist’s judgment. Adaptive systems for post-processing and memory, as mentioned above, are applicable to other areas of medical imaging apart from lung cancer diagnosis, acting as a blueprint for future image diagnostic systems. The model has demonstrated good generalization performance using cross-dataset validation with LIDC-IDRI, LUNA16, and NSCLC-Radiomics datasets. The model achieves good performance with AUC ranging from 92.8% up to 97.2%, thus showing its potential for application in various clinical scenarios. Future work will concentrate on three areas: 3D volume expansion, validation across multiple institutions, and clinical trials.
Acknowledgement: The authors would like to specially acknowledge the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R757), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The National Natural Science Foundation of China Grant numbers 62071153, 62571163, and 62571167 are acknowledged for support to the first author, Wariyo Godana Arero.
Funding Statement: This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R757), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia and the National Natural Science Foundation of China under Grant numbers 62071153, Grant 62571163 and Grant 62571167.
Author Contributions: Wariyo Godana Arero: Conceptualization, methodology, formal analysis, and writing original draft preparation. Yaqin Zhao: Conceptualization, resources, funding acquisition, and supervision. Longwen Wu: Investigation, writing review, and editing. Mudasir Ahmad Wani, Sadique Ahmad, Pir Noman Ahmad, Kashish Ara Shakil, Sidrak Habtemariam Teredda and Merhawit Berhane Teklu were responsible for review, editing and data collection. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: Data openly available in a public repository. The data used in this study has been appropriately cited within the manuscript. The data used in our study are listed here: (1) IQ-OTHNCCD dataset The IQ-OTH/NCCD lung cancer dataset—Mendeley Data. (2) LIDC-IDRI dataset https://www.kaggle.com/datasets/washingtongold/lidcidri30. (3) LUNA16 dataset https://luna16.grand-challenge.org/Data/. (4) NSCLC-Radiomics dataset https://www.kaggle.com/datasets/umutkrdrms/nsclc-radiomics.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74(3):229–63. doi:10.3322/caac.21834. [Google Scholar] [PubMed] [CrossRef]
2. Siegel RL, Kratzer TB, Giaquinto AN, Sung H, Jemal A. Cancer statistics. 2025 CA Cancer J Clin. 2025;75(1):10–45. doi:10.3322/caac.21871. [Google Scholar] [PubMed] [CrossRef]
3. The National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med. 2011;365(5):395–409. doi:10.1056/nejmoa1102873. [Google Scholar] [PubMed] [CrossRef]
4. Gould MK, Tang T, Liu IA, Lee J, Zheng C, Danforth KN, et al. Recent trends in the identification of incidental pulmonary nodules. Am J Respir Crit Care Med. 2015;192(10):1208–14. doi:10.1164/rccm.201505-0990OC. [Google Scholar] [PubMed] [CrossRef]
5. MacMahon H, Naidich DP, Goo JM, Lee KS, Leung ANC, Mayo JR, et al. Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner society 2017. Radiology. 2017;284(1):228–43. doi:10.1148/radiol.2017161659. [Google Scholar] [PubMed] [CrossRef]
6. Suzuki K. Overview of deep learning in medical imaging. Radiol Phys Technol. 2017;10(3):257–73. doi:10.1007/s12194-017-0406-5. [Google Scholar] [PubMed] [CrossRef]
7. Chen H, Kim AW, Hsin M, Shrager JB, Prosper AE, Wahidi MM, et al. The 2023 American association for thoracic surgery (AATS) expert consensus document: management of subsolid lung nodules. J Thorac Cardiovasc Surg. 2024;168(3):631–47.e11. doi:10.1016/j.jtcvs.2024.02.026. [Google Scholar] [PubMed] [CrossRef]
8. Shafiq M, Gu Z. Deep residual learning for image recognition: a survey. Appl Sci. 2022;12(18):8972. doi:10.3390/app12188972. [Google Scholar] [CrossRef]
9. Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning PMLR; 2019 Jun 9–15; Long Beach, CA, USA. p. 6105–14. [Google Scholar]
10. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:1–11. doi:10.65215/pc26a033. [Google Scholar] [CrossRef]
11. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations; 2021 May 3–7; Virtual. [Google Scholar]
12. Shamshad F, Khan S, Zamir SW, Khan MH, Hayat M, Khan FS, et al. Transformers in medical imaging: a survey. Med Image Anal. 2023;88(1):102802. doi:10.1016/j.media.2023.102802. [Google Scholar] [PubMed] [CrossRef]
13. Wang H, Cao P, Wang J, Zaiane OR. UCTransNet: rethinking the skip connections in U-Net from a channel-wise perspective with transformer. Proc AAAI Conf Artif Intell. 2022;36(3):2441–9. doi:10.1609/aaai.v36i3.20144. [Google Scholar] [CrossRef]
14. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imag. 2020;39(6):1856–67. doi:10.1109/TMI.2019.2959609. [Google Scholar] [PubMed] [CrossRef]
15. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T. Meta-learning with memory-augmented neural networks. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning; 2016 Jun 19–24; New York, NY, USA. p. 1842–50. [Google Scholar]
16. Kumaran D, Hassabis D, McClelland JL. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends Cogn Sci. 2016;20(7):512–34. doi:10.1016/j.tics.2016.05.004. [Google Scholar] [PubMed] [CrossRef]
17. Siddique N, Paheding S, Elkin CP, Devabhaktuni V. U-Net and its variants for medical image segmentation: a review of theory and applications. IEEE Access. 2021;9:82031–57. doi:10.1109/ACCESS.2021.3086020. [Google Scholar] [CrossRef]
18. Hwang EJ, Nam JG, Lim WH, Park SJ, Jeong YS, Kang JH, et al. Deep learning for chest radiograph diagnosis in the emergency department. Radiology. 2019;293(3):573–80. doi:10.1148/radiol.2019191225. [Google Scholar] [PubMed] [CrossRef]
19. Xu B, Liu J, Hou X, Liu B, Garibaldi J, Ellis IO, et al. Attention by selection: a deep selective attention approach to breast cancer classification. IEEE Trans Med Imaging. 2019;39(6):1930–41. doi:10.1109/TMI.2019.2962013. [Google Scholar] [PubMed] [CrossRef]
20. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 2261–9. doi:10.1109/CVPR.2017.243. [Google Scholar] [CrossRef]
21. Elhassan SM, Darwish SM, Elkaffas SM. An enhanced lung cancer detection approach using dual-model deep learning technique. Comput Model Eng Sci. 2025;142(1):835–67. doi:10.32604/cmes.2024.058770. [Google Scholar] [CrossRef]
22. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 2117–25. [Google Scholar]
23. Singh A, Athisayamani S, Joshi G, Shrestha B. Multi-scale dilated convolution network for SPECT-MPI cardiovascular disease classification with adaptive denoising and attenuation correction. Comput Model Eng Sci. 2024;142(1):299–327. doi:10.32604/cmes.2024.055599. [Google Scholar] [CrossRef]
24. Arero W, Zhao Y, Wu L, Wang Y. DKCDF: dual-kernel CNN with dual feature fusion for lung cancer detection. In: Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies; 2024 Feb 21–23; Rome, Italy. p. 54–64. doi:10.5220/0012406100003657. [Google Scholar] [CrossRef]
25. Arero WG, Zhao Y, Wu L, Sori WJ. 3D CNN lung cancer classification with dual feature fusion. In: 2025 IEEE 23rd International Conference on Industrial Informatics (INDIN); 2025 Jul 12–15; Kunming, China. p. 1–6. doi:10.1109/INDIN64977.2025.11278981. [Google Scholar] [CrossRef]
26. Lu Z, Liu F, Wang L, Xu L, Liu X. A novel lung nodule detection and recognition model based on deep learning. IEEE Access. 2024;12(8):155990–6002. doi:10.1109/ACCESS.2024.3478358. [Google Scholar] [CrossRef]
27. Mkindu H, Wu L, Zhao Y. Lung nodule detection in chest CT images based on vision transformer network with Bayesian optimization. Biomed Signal Process Control. 2023;85(1):104866. doi:10.1016/j.bspc.2023.104866. [Google Scholar] [CrossRef]
28. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jegou H. Training data-efficient image transformers & distillation through attention. In: Meila M, Zhang T, editors. Proceedings of the 38th International Conference on Machine Learning; 2021 Jul 18–24; Virtual. p. 10347–57. [Google Scholar]
29. Zhou SK, Greenspan H, Davatzikos C, Duncan JS, Van Ginneken B, Madabhushi A, et al. A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proc IEEE. 2021;109(5):820–38. doi:10.1109/JPROC.2021.3054390. [Google Scholar] [PubMed] [CrossRef]
30. Mkindu H, Wu L, Zhao Y. 3D multi-scale vision transformer for lung nodule detection in chest CT images. Signal Image Video Process. 2023;17(5):2473–80. doi:10.1007/s11760-022-02464-0. [Google Scholar] [CrossRef]
31. He X, Yang X, Zhang S, Zhao J, Zhang Y, Xing EP, et al. Sample-efficient deep learning for COVID-19 diagnosis based on CT scans. medRxiv. 2020. doi:10.1101/2020.04.13.20063941. [Google Scholar] [CrossRef]
32. Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 12114–24. doi:10.1109/CVPR52688.2022.01181. [Google Scholar] [CrossRef]
33. Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, et al. Swin transformer V2: scaling up capacity and resolution. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 11999–2009. doi:10.1109/CVPR52688.2022.01170. [Google Scholar] [CrossRef]
34. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 9992–10002. doi:10.1109/ICCV48922.2021.00986. [Google Scholar] [CrossRef]
35. Chen J, Mei J, Li X, Lu Y, Yu Q, Wei Q, et al. TransUNet: rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med Image Anal. 2024;97(2):103280. doi:10.1016/j.media.2024.103280. [Google Scholar] [PubMed] [CrossRef]
36. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision—ECCV 2022 Workshops; 2022 Oct 23–27; Tel Aviv, Israel. p. 205–18. doi:10.1007/978-3-031-25066-8_9. [Google Scholar] [CrossRef]
37. Xie Y, Zhang J, Shen C, Xia Y. CoTr: efficiently bridging CNN and transformer for 3D medical image segmentation. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, et al., editors. Medical Image Computing and Computer Assisted Intervention—MICCAI 2021; 2021 Sep 27–Oct 1; Strasbourg, France. p. 171–80. doi:10.1007/978-3-030-87199-4_16. [Google Scholar] [CrossRef]
38. Li H, Ren Z, Zhu G, Liang Y, Cui H, Wang C, et al. Enhancing medical image segmentation with MA-UNet: a multi-scale attention framework. Vis Comput. 2025;41(8):6103–20. doi:10.1007/s00371-024-03774-9. [Google Scholar] [CrossRef]
39. Bria A, Marrocco C, Tortorella F. Addressing class imbalance in deep learning for small lesion detection on medical images. Comput Biol Med. 2020;120(132):103735. doi:10.1016/j.compbiomed.2020.103735. [Google Scholar] [PubMed] [CrossRef]
40. Ozdemir B, Aslan E, Pacal I. Attention enhanced InceptionNeXt-based hybrid deep learning model for lung cancer detection. IEEE Access. 2025;13:27050–69. doi:10.1109/ACCESS.2025.3539122. [Google Scholar] [PubMed] [CrossRef]
41. Narayanan M. SENetV2: aggregated dense layer for channelwise and global representations. arXiv:2311.10807. 2023. doi:10.48550/arXiv.2311.10807. [Google Scholar] [CrossRef]
42. Woo S, Park J, Lee J-Y, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany. p. 3–19. [Google Scholar]
43. Wu S, Guo Y, Qian C, Li Y, Zhang X. Global attention and context encoding for enhanced medical image segmentation. Vis Comput. 2025;41(10):7781–98. doi:10.1007/s00371-025-03838-4. [Google Scholar] [CrossRef]
44. Omidi P, Huang X, Laborieux A, Nikpour B, Shi T, Eshaghi A. Memory-augmented transformers: a systematic review from neuroscience principles to enhanced model architectures. arXiv:2508.10824. 2025. doi:10.48550/arXiv.2508.10824. [Google Scholar] [CrossRef]
45. Gallée L, Lisson CS, Lisson CG, Drees D, Weig F, Vogele D, et al. Evaluating the explainability of attributes and prototypes for a medical classification model. In: Explainable Artificial Intelligence; 2024 Jul 17–19; Valletta, Malta. doi:10.1007/978-3-658-47422-5_17. [Google Scholar] [CrossRef]
46. Chen L, Bentley P, Mori K, Misawa K, Fujiwara M, Rueckert D. Self-supervised learning for medical image analysis using image context restoration. Med Image Anal. 2019;58(11):101539. doi:10.1016/j.media.2019.101539. [Google Scholar] [PubMed] [CrossRef]
47. Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated learning for healthcare informatics. J Healthc Inform Res. 2021;5(1):1–19. doi:10.1007/s41666-020-00082-4. [Google Scholar] [PubMed] [CrossRef]
48. Chen C, Li O, Tao C, Barnett AJ, Su J, Rudin C. This looks like that: deep learning for interpretable image recognition. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems; 2019 Dec 8–14. Vancouver, BC, Canada. Red Hook, NY, USA: Curran Associates Inc.; 2019. p. 8930–41. [Google Scholar]
49. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42(13):60–88. doi:10.1016/j.media.2017.07.005. [Google Scholar] [PubMed] [CrossRef]
50. Zhu X, Su W, Lu L, Li B, Wang X, Dai J, et al. Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations ICLR 2021; 2021 May 3–7; Virtual. p. 1–16. [Google Scholar]
51. Alyasriy H, AL-Huseiny M. The IQ-OTH/NCCD lung cancer dataset. Mendeley Data. 2023;4. doi:10.17632/bhmdr45bh2.4. [Google Scholar] [CrossRef]
52. Mader KS. The lung image database consortium image collection (LIDC-IDRI). IEEE DataPort. 2021. doi:10.21227/ZCE3-JP96. [Google Scholar] [CrossRef]
53. Wang NW. LUNA 16. IEEE DataPort. 2025. doi:10.21227/0KJP-G187. [Google Scholar] [CrossRef]
54. Aerts HJWL, Rios Velazquez E, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S, et al. NSCLC-radiomics-genomics. 2015. doi:10.7937/K9/TCIA.2015.L4FRET6Z. [Google Scholar] [CrossRef]
55. Saied M, Raafat M, Yehia S, Khalil MM. Efficient pulmonary nodules classification using radiomics and different artificial intelligence strategies. Insights Imag. 2023;14(1):91. doi:10.1186/s13244-023-01441-6. [Google Scholar] [PubMed] [CrossRef]
56. Faizi MK, Qiang Y, Wei Y, Qiao Y, Zhao J, Aftab R, et al. Deep learning-based lung cancer classification of CT images. BMC Cancer. 2025;25(1):1056. doi:10.1186/s12885-025-14320-8. [Google Scholar] [PubMed] [CrossRef]
57. Shilpa, Kaur K, Kumar N, Sharma P. Comparison of accuracy in lung cancer classification through convolutional neural network models based on histopathological image. In: Madhavi KR, Ramrao N, Kumar K, Raju KS, Sellathurai M, editors. Proceedings of Sixth International Conference on Computer and Communication Technologies; 2024 Oct 4–5. Tirupati, India. Singapore: Springer; 2025. p. 327–34. doi:10.1007/978-981-96-5238-9_29. [Google Scholar] [CrossRef]
58. Gautam N, Basu A, Sarkar R. Lung cancer detection from thoracic CT scans using an ensemble of deep learning models. Neural Comput Appl. 2024;36(5):2459–77. doi:10.1007/s00521-023-09130-7. [Google Scholar] [CrossRef]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools