GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification

Munthir Qasaimeh; Mostafa Ali; Qasem Al-Haija

doi:10.32604/cmes.2026.080864

icon Open Access

ARTICLE

GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification

Munthir Qasaimeh¹, Mostafa Ali¹, Qasem Abu Al-Haija^2,*

1 Department of Computer Information Systems, Jordan University of Science and Technology, Irbid, Jordan
2 Department of Cybersecurity, Jordan University of Science and Technology, Irbid, Jordan

* Corresponding Author: Qasem Abu Al-Haija. Email: email

(This article belongs to the Special Issue: Advanced Computational Intelligence Techniques, Uncertain Knowledge Processing and Multi-Attribute Group Decision-Making Methods Applied in Modeling of Medical Diagnosis and Prognosis)

Computer Modeling in Engineering & Sciences 2026, 147(2), 42 https://doi.org/10.32604/cmes.2026.080864

Received 17 February 2026; Accepted 07 April 2026; Issue published 27 May 2026

Abstract

Vision Transformers (ViTs) have recently achieved high performance in retinal Optical Coherence Tomography (OCT) classification studies. However, ViT models continue to face significant challenges, including high computational cost, vulnerability to adversarial attacks, and pronounced sensitivity to preprocessing techniques. This study introduces GreenShield, a unified framework designed to produce an efficient and robust ViT model, referred to as GreenShield-ViT, which outperforms existing lightweight ViT variants in terms of adversarial robustness for retinal OCT classification. The framework integrates a gradient-based block-importance pruning strategy to compress the ViT/B-16 architecture, and adversarial training with proper ImageNet normalization and anti-saturation techniques. The robustness was evaluated using FGSM, PGD, PGD-R3, Transfer-PGD, BIM, and the proposed hybrid attack (FGSM-PGD). The proposed approach achieves an approximately 50% reduction in Floating-Point Operations (FLOPs), inference time, and carbon footprint emissions, while preserving diagnostic accuracy. Experiments conducted using GPU P100 on the OCT-c8, OCTID, and UCSD-3 datasets achieved clean accuracies of 92.5%, 94.78%, and 89.20%, respectively, alongside a significant reduction in attack success rates and improved model calibration. GreenShield-ViT outperformed lightweight ViT variants (Mobile-ViT, ViT-Tiny, ViT-Small) in terms of robustness while offering competitive efficiency. These results suggest its applicability to similar ViT-based medical tasks.

Keywords

Retinal OCT classification; green AI; adversarial training; vision transformers

1 Introduction

Retinal diseases are a major cause of blindness. Many patients lose vision early because of their retinal conditions. Optical Coherence Tomography (OCT) was performed in 1991 [1]. This provides the capability to capture retinal layers and deeper details with high-resolution imaging in a noninvasive manner. Despite the capabilities of photography, ophthalmologists still face challenges in their diagnosis. For example, when facing an overlapping disease pattern, manual interpretation is time-consuming and susceptible to human error. All of these challenges were addressed at the beginning of the Artificial Intelligence (AI) era. Deep learning models such as Convolutional Neural Networks (CNNs), which enable rapid and accurate analysis of large OCT volumes, support ophthalmologists in enhancing diagnostic confidence.

In 2020, the Vision Transformer models (ViTs) were introduced as an alternative to CNNs [2] for use in image-processing tasks. Recent retinal OCT studies have demonstrated that ViTs outperform CNNs in terms of performance, while CNNs are more efficient than ViTs [3]. However, both models are vulnerable to adversarial attacks. AI models in retinal OCT applications face several challenges, including security concerns, sustainability, and reliable preprocessing pipelines [4,5]. Security challenges include adversarial attacks, which can apply slight changes to an image to fool AI models [6]. This may cause misdiagnosis in the medical field, which could be a primary cause of disasters. By contrast, hospitals and clinics often lack access to high-performance servers that can deploy computationally intensive models. Moreover, inference time is a critical factor, especially in cases that must handle a large patient volume daily. Reducing model complexity decreases energy consumption (carbon emissions) and makes deployment more eco-friendly, contributing to the advancement of green AI. A list of lightweight ViT architectures was introduced in retinal OCT studies to address efficiency challenges without security enhancement considerations [7]. Moreover, prior OCT studies, which focused on security, ignored preprocessing challenges, such as softmax saturation and normalization space [8–10].

ViT models require ImageNet-consistent normalization owing to their ImageNet pretraining [11]. The conflict between the adversarial attack space (0, 1-pixel space) and the normalization space (ImageNet normalization space) changes the attack’s power and harms adversarial training. Additionally, in adversarial work, handling Softmax saturation is necessary [12]. The ViT model splits the images into patches. When the model focuses on one patch and ignores the other patches, one patch becomes dominant (the highest attention score), and SoftMax saturates. After softmax saturation, the model will be overconfident. Slight changes in the dominant patch lead to misclassification with a high confidence rate [12–14]. All of these challenges motivated us to design a robust and lightweight ViT model with a reliable preprocessing pipeline.

Many OCT classification studies have been limited to binary classification or a limited number of classes. In adversarial work, the majority of studies have focused on CNNs (ignoring ViTs). Studies that enhanced ViT efficiency did not introduce a compression technique. Several studies on improving OCT security in the literature have ignored sustainability, while studies that focused on improving efficiency have ignored security challenges. Limited retinal OCT studies have addressed the impact of preprocessing challenges such as softmax saturation and normalization in adversarial training. However, this study aims to fill these research gaps by making the following contributions:

• This study proposed an empirically validated framework that integrates the gradient-based block-importance pruning with adversarial training for retinal OCT classification, jointly enhancing sustainability and robustness against attacks.

• Builds a lightweight model that outperforms existing lightweight ViT variants (Mobile-ViT, ViT-Small, and ViT-Tiny) in robustness against White-box L-infinity adversarial attacks (FGSM, PGD, BIM, and a Hybrid attack).

• Enhances the perturbation generation in adversarial training by addressing the normalization issue in the prior OCT work.

• Improves the Model calibration (to avoid softmax saturation).

• Conducts a comparative adversarial robustness analysis of existing lightweight ViT variants (Mobile-ViT, ViT-Small, and ViT-Tiny).

This paper is structured as follows: Section 2 provides a background that covers related concepts and findings from related work. Section 3 describes the proposed methodology. Section 4 presents experiments and results. Section 5 presents the discussion. Section 6 provides the limitations and future directions. Finally, Section 7 presents our conclusions.

2 Related Work

This section offers a comprehensive review of the relevant literature, emphasizing its findings, scholarly contributions, and identified research gaps. Related work is systematically organized into three sections: CNN-based, transformer-based, and hybrid CNN-transformer-based approaches.

2.1 CNN-Based Approaches

This section covers studies that have utilized CNNs to classify retinal images with different objectives. Numerous studies in existing literature have concentrated on enhancing robustness. For example, the work done in [15], the authors built a robust binary classification model to detect diabetic retinopathy (DR). The Indian IDRiD and OCT datasets were used. The study applied hold-out validation (15,400 training, 3720 validations, and 400 testing sets). The ResNet-50 model with pre-trained weights (ImageNet) was fine-tuned to the proposed datasets. Adversarial training and distillation were performed to address the FGSM and L-BFGS attacks. The distillation technique outperformed adversarial training, reducing the success rates of both attacks (FGSM and L-BFGS-B by 60% and 82%, respectively).

In contrast, the authors of [16] proposed a light, robust CNN model to address the sustainability and security of retinal OCT classification. The study used Kermany OCT, COVID, and Kidney Stone CT datasets. An under-sampling technique was used to balance the data. Hold-out validation was applied (80% training and 20% for testing). Cross-validation with 10 folds was applied to the training set. The study designed a custom lightweight CNN model. No defense mechanisms were applied. The model was evaluated using four adversarial FGSM examples. Achieving 94% clean accuracy on the Kermany dataset.

Furthermore, some retinal OCT studies have utilized CNN with a focus on accuracy enhancements. For example, in [17], the study aimed to develop a system for detecting retinal diseases using OCT images. This study proposes two models for feature extraction (ResNet-50 and AlexNet) and uses the Kermany dataset. The proposed CNN network is used in parallel to extract the features. Two feature selection techniques were applied in parallel (PCA and Entropy-based Ant Colony System). The selected features were combined into an optimized vector and then fed into Machine Learning models, such as K-Nearest Neighbors(kNN), Linear Support Vector Machine (LSVM), Linear Discriminant (LD), and Decision trees. The LD classifier achieved high accuracy (99.9%).

Similarly, the authors in [18] addressed the challenge of multi-class accuracy (retinal disease classification) under a limited and imbalanced dataset, proposing a modified VGG16-based CNN with data augmentation, class weighting, and explainability techniques (Grad-CAM) using a small Mendeley fundus dataset of 302 images. The model achieved 93.42% training accuracy and 77.5% validation accuracy, with improved interpretability and class balance.

All these studies were limited to CNNs and did not address the improvement in the robustness and efficiency of ViT models.

2.2 Transformer-Based Approaches

This section covers studies in the existing literature that utilize Vision Transformers as the main model. Most studies that employed Transformers in Retinal OCT classification focused on efficiency. For example, the authors in [10] designed a light approach for Diabetic Retinopathy Classification. This study proposed resizing the images to 896 × 896 pixels and splitting them into 16 patches (224 × 224 pixels). This study proposes a ViT-small model as a feature extractor. The Classification tokens (CLS) were combined into a matrix (16 × 384) and fed into the Global Instance Computing Block (GICB) layer to compute attention scores, then fed to a Multi-Layer Perceptron (MLP) layer with a residual. The proposed approach was applied to APTOS 2019 and Messidor-1 datasets. The inference time is reduced by 62%, and the accuracy is improved by 2.1% for the APTOS and 12.1% for the Messidor dataset.

Extending the focus to efficiency, the authors in [19] evaluated a list of ViT models (ViT, T2T ViT, and Mobile-ViT) for retinal OCT classification. This study proposed mobile-ViT as the main approach. The public OCT and Mendeley datasets were used. Hold-out validation was applied to propose 10% as the validation dataset. Mobile-ViT outperformed all the models by achieving 99.1% accuracy with better efficiency.

Some studies have employed ViTs without considering sustainability, such as the authors in [9], who compared two Vision Transformer variants (ViT-16 and ViT-32) in the binary classification of retinal OCT images. The ViT models were trained to detect Diabetic Macular Edema (DME), using three optimizers: Adam, Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSProp). Two balanced datasets were proposed: KMC and Mendeley Datasets. A custom classification head was added. The GRAD-CAM was used for visualization. ViT-16 outperformed all the baselines across both datasets. The Adam optimizer achieved the best results, with 100% recall and stable training.

However, a limited number of OCT studies focused on ViT robustness against adversarial attacks, such as in [8]. This study proposed adversarial training as a defense mechanism. Exploring the performance using a binary labeled list of medical datasets: Fundoscopy, Chest X-ray, ISIC2019, Breast Histology. Images were normalized using ImageNet normalization (during the preprocessing phase). The proposed model is a pretrained ViT–Tiny model. The study employed hold-out validation (85% of the training set and 15% of the testing set). All models were evaluated against multiple attacks (FGSM, BIM, PGD, and CW) using these hyperparameters (ε varied between 0.25/255 and 5/255 for FGSM, PGD-20 steps, and BIM-40 steps). The ViT model outperformed all CNNs on balanced datasets in terms of both clean accuracy and adversarial robustness.

The authors of [10,19] proposed lightweight models without evaluating their robustness to attacks. In [9], ViT variants were compared based on performance, without considering robustness against attacks. The study in [8] focused on robustness enhancement but utilized an imbalanced dataset and applied adversarial training without considering proper ImageNet normalization.

2.3 Hybrid CNN-Transformer-Based Approaches

Many studies have employed a hybrid framework combining a CNN with a Transformer to classify retinal OCT images in different research directions. For example, the authors of [20] aimed to enhance efficiency by proposing a hybrid CNN-Vision Transformer architecture on OCT2017, OCT-C8, and an external validation dataset. The model architecture comprises a list of blocks. The Depthwise Convolution DW block uses depth convolution for local lesion scanning. The MobileViT block (MV) is based on Mobile-ViT for learning global patterns. The Convolutional Block Attention Module (CBAM) block focuses on the relevant regions using average pooling, with a final layer for classification. The proposed approach achieved a high performance, but without enhancing the resistance of the model to attacks.

Similarly, the authors of [21] proposed an MGR-GAN to classify retinal OCT images. This hybrid model architecture combines a Transformer with a CNN. The UCSD OCT dataset was used in this study. The study designed a generator (Transformer-based) to generate images, and a discriminator (ResNet) to distinguish between generated and original images. The discriminator is also used for feature extraction and classification with three convolutional layers (F1, F2, and F3) and a softmax layer. The model achieved a 99% accuracy rate. This study did not focus on the security and sustainability of the model.

Similarly, the authors [22] worked to address many problems in the related work, including the need for large, balanced, diverse datasets, and the reduction of health risks in Fluorescein Angiography (FA) imaging that requires dye injection. The study used a large ultra-widefield retinal imaging dataset with 1198 patients, along with Messidor-2 for external evaluation, and proposed a hybrid approach combining a pix2pixHD-based GAN for multi-phase FA image synthesis and a Swin Transformer for diabetic retinopathy classification. The results demonstrated that the generator achieved high realism (SSIM = 0.82–0.85) with an AUC of 0.910, and an accuracy of 82.9%.

Similarly, the authors of [23] proposed an RAD-IoMT defense mechanism. It is a transformer-based detector that detects adversarial attacks and then passes only clean attacks to the CNN classifier. This study used three datasets (retinal OCT, skin cancer, and chest X-ray). The VGG-16 (standard 138 million parameters) model was trained as a classifier. The transformer was trained using binary classification to detect attacks. The study performed four attacks (white box: PGD and FGSM, and black box: AGN and AUN). The detector achieved an accuracy between 92% and 95%, whereas the classifier achieved an accuracy between 86% and 97%. However, deploying two complex models (detector and classifier) incurs an extra cost (heavy deployment). Additionally, if the attacker fools the filter (detector), the classifier crashes (a Single Point of Failure). Table 1 summarizes the results of the previous studies.

images

3 Materials and Methodology

This section presents the proposed GreenShield framework, detailing all phases involved in building and evaluating the GreenShield-ViT model. The process begins with the data collection phase, in which the target dataset is selected. This is followed by a model selection phase, in which an appropriate architecture is identified. A clean training phase was then conducted to train the model on the clean samples (unperturbed samples). Next, a model-pruning phase was applied to compress the model architecture and fine-tune the resulting reduced model. Subsequently, an adversarial training phase is performed to enhance the robustness of the pruned model against adversarial perturbations. Finally, the evaluation setup phase specified the evaluation criteria, established baselines, and defined the metrics used to assess the performance of the proposed approach through comprehensive experiments. The proposed methodology is illustrated in Fig. 1.

images

Figure 1: Proposed methodology phases.

3.1 Data Collection Phase

This study adopts the Retinal OCT classification C-8 dataset [24] because this dataset was designed for multi-classification projects, including eight classes. This dataset is publicly available on Kaggle, labeled, and categorized with a balanced class distribution, including 24,000 images. This dataset contains seven retinal OCT diseases: Age-related Macular Degeneration (AMD), Choroidal Neovascularization (CNV), Central Serous Retinopathy (CSR), Diabetic Macular Edema (DME), Diabetic Retinopathy (DR), Drusen, Macular Hole, and Normal Retina. This makes the dataset ideal for building predictive models. Fig. 2 shows an example of each class. In addition to the C-8 dataset, this study incorporates the OCTID dataset, which is available on Kaggle and consists of five classes: Age-related Macular Degeneration (AMD), Central Serous Retinopathy (CSR), Diabetic Retinopathy (DR), Macular Hole (MH), and Normal retina. The dataset contains around 588 OCT images. OCTID is widely used to evaluate model generalization under limited data scenarios. In this study, it is used to address domain shift through a small fine-tuning subphase [25].

images

Figure 2: Retinal OCT C-8 dataset classes [24].

Furthermore, the UCSD-3 dataset (derived from the UCSD OCT dataset) is included for additional evaluation. This dataset is also available on Kaggle and consists of three classes: Choroidal Neovascularization (CNV), Diabetic Macular Edema (DME), and Normal retina. In this study, it is used as an external evaluation dataset. The test set includes 750 images, which are used for performance assessment [26].

3.2 Model Selection Phase

This study proposes training a Vision Transformer model to classify retinal OCT diseases. The selected model backbone was ViT/B-16. This model serves as the standard and original baseline for Vision Transformer models. This architecture has demonstrated a robust performance in retinal OCT studies. However, further improvements in efficiency and robustness are required. This architecture is based on the transformer model [2]. It splits images into patches, each of which is treated as a token. It was pre-trained on ImageNet1k. Patches are processed through a linear layer to form a vector, which is referred to as patch embedding. Positional embeddings were added to preserve the spatial ordering. In addition, a special classification token (CLS) is appended to summarize all patches and capture all the relationships and information needed to help in the prediction (holding the final decision). The vectors were then processed using blocks. Each block contained a multihead self-attention (MHSA) and an MLP layer. Through the self-attention mechanism, the image patches attend to each other to model long-range dependencies. MHSA computes the attention scores using a dot product. ViT splits embeddings into heads. For each head, the Core Attention Formula (1) was applied [2].

Attention(Q,K,V)=softmax(QKT/sqrt(dk))(1)

The Softmax is applied to assign attention weights to each patch, and the CLS token is then processed through a linear layer and softmax to generate probabilities for each class. After MHSA, the embeddings were processed through an MLP layer (with an activation function) to capture the structures and features inside each image. The backbone architecture consisted of 12 blocks, 12 attention heads, 64 head dimensions, a 4.0 MLP ratio with GELU activation, 196 patches including the CLS patch, 86 million parameters, and a classification head. It is the base ViT model version, with a 16 × 16 patch size, 224 × 224 image size, and 768 embedding lengths. The attention mechanism enables the ViT model to capture the relationships between patches and comprehend the differences between various regions within the images. Fig. 3 shows the architecture of the vision transformer [2].

images

Figure 3: Vision transformer architecture [2].

3.3 Clean Training Phase

In this phase, the selected ViT model is trained. The proposed training pipeline starts with hold-out validation, 70% training, 15% validation, and 15% testing. This study proposed the application of clean training in two rounds. First, the model is trained to build a reliable classifier before pruning. Second, we fine-tune the pruned model to help it adapt to the new architecture. Two rounds were conducted using the same setup. Only a different number of epochs was used (first round 10 epochs, second round 2 epochs). An NVIDIA Tesla GPU P100 was utilized to train and evaluate the model (provided by Kaggle), 32 batch size and four CPU workers to prepare batches during training.

The training set was utilized for building the model, and the validation set for model selection, hyperparameter tuning, and pruning. The test set was used in the evaluation phase. The pipeline started resizing and then cropping to 224 × 224. Augmentation (random zoom, rotation (+−20), brightness (0.3), contrast adjustment (0.3), flipping (p = 0.5), and shift (+−10)). Subsequently, ImageNet normalization was applied. In clean training, ImageNet normalization can be applied directly during the preprocessing phase without any issues. However, in the adversarial training phase, this study proposed a custom pipeline. The training setup includes cross-entropy as a loss function (difference between actual and predicted values) and an AdamW optimizer to change the model parameters according to gradients to reduce the loss. Weight decay (L2-regularization to prevent parameters from having large values) with the Warmup scheduler (to control the LR values during training), Automatic Mixed Precision (AMP) (to reduce training cost), and LR cosine. A list of anti-softmax saturation techniques was proposed: Temperature scaling 1.5, Post-Attention Layer Norm after attention residual, Label Smoothing 0.1, and Gradient Clipping 1.0.

In each epoch, the model switched from the training model to the validation model. This assists in monitoring the performance of the model during the training. This study proposes a light-validation setup that includes six batches (192 images). Early stopping is triggered to stop the training after three rounds without any enhancements. In the validation and evaluation, the image was resized and cropped to 224 × 244 pixels. However, without augmentation. In addition, light hyperparameter tuning was proposed to select the optimal values for two parameters: the Learning Rate and weight decay. The hyperparameter tuning mechanism was based on five training epochs (8% sample from the training set), with a small evaluation (while training) by a light validation of three batches. This selects the optimal values through the proposed search space for each parameter, as follows:

• Learning rate search space: {5 × 105, 1 × 104, 2 × 104, 3 × 104}

• Weight decay search space: {0.005, 0.01, 0.02, 0.05}

The values selected based on hyperparameter tuning are (LR = 5 × 105 and Weight decay = 0.01). Fig. 4 summarizes the proposed clean training pipeline.

images

Figure 4: The proposed clean training pipeline, including all subphases.

3.4 Model Pruning Phase

This phase prunes the proposed trained model after clean training. This study proposes a pruning mechanism based on a validation set. An importance score was assigned to each block of the ViT model.

In the ViT structure, a single gradient is specified for each parameter within a single neuron. The gradient during training represents two values: magnitude and direction. The magnitude represents the impact of this parameter change on the loss. The direction illustrates how the parameter should be adjusted to minimize loss. The direction was ignored by considering only the magnitude. This determines the importance of the parameter. A neuron with high-magnitude parameters (weights) is an especially important neuron, while a block that holds a particularly important list of neurons is also particularly important. The absolute gradient value represents the magnitude without direction (i.e., it does not include positive or negative signs in the absolute values). The average absolute value for all gradients belonging to a single block represents the importance of the block (how strongly this block impacts).

The proposed mechanism computes the average values of the absolute gradients within each block. Starting with 30 batches of images from the validation set. With forward and backward passes (backpropagation), gradients are calculated for each block. The forward pass is used to compute the loss, and the backward pass computes the gradients. The average of the absolute gradients can be computed as (sum of all gradients divided by the number of gradients). Then, a ranking phase was conducted to identify high- or low-impact blocks (ranked by importance). A low average has a minimal impact. The pruned architecture was fine-tuned using clean training within 2 epochs. This allows the model to adapt to the new architecture and achieve high performance. The proposed ViT backbone included 12 blocks. This mechanism proposes determining the search space of architectures to determine the optimal one. The proposed search space consisted of 12 blocks, resulting in 12/2 = 6. Then, 6 + 2 and 6 − 2. The proposed search space consists of (4, 6, 8) blocks. In the experimental phase, we conducted a series of evaluations to assess each option within the search space. Starting from removing eight blocks and keeping four, pruning six, or pruning four. The study selects the most efficient architecture that only drops the accuracy by less than 2% to find a sustainable model without degrading the performance. The search space represents approximately 30%, 50%, and 70% reductions in model depth. This selection provides well-separated and meaningful compression regions. Where 8 blocks approximate the upper range (layers 7–11), 6 blocks capture the central 50% compression point, and 4 blocks represent the lower range (layers 1–5). This search space provides a representative summary of performance-efficiency trade-off across the full depth (1–12 blocks), enabling effective analysis without exhaustively evaluating all configurations. The next phase involves the proposed adversarial training pipeline. Fig. 5 shows the pruning approach.

images

Figure 5: Proposed pruning approach.

3.5 Adversarial Training Phase

The adversarial training phase is an extended training phase after cleaning to make the model robust against adversarial attacks. Adversarial training shows the model slightly modified images during training, which forces the model to learn a feature while staying stable even when the input changes. The model learned to focus on global patterns instead of intricate details, making its layers and tokens more robust. This phase applies adversarial training to the pruned model by applying the same anti-saturation techniques applied to the clean training pipeline (temperature scaling 1.5, post-attention layer norm, label smoothing 0.1, and Gradient Clipping 1.0). The hold-out validation includes a 70% training set, 15% testing set, 15% validation set. The validation set consisted of a stratified sample of 125 images belonging to each class. Validation with only the forward pass (without gradients). The model switches from training mode to validation mode after the end of each training epoch. Validation of 2800 images to maintain clean generalization. Validating clean performance is important to ensure that adversarial training does not harm the performance against clean data. Adversarial training is more complex than clean training and requires more resources. This study proposes the application of AMP precision, while hyperparameter tuning and augmentation were ignored in this phase to avoid additional costs. The same GPU engine (p100) was used in the training and evaluation phases.

The images were resized to 224 × 224 pixels. ImageNet normalization is then applied. ImageNet normalization is important for pretrained models (such as ViTs) to ensure consistency in the input distribution. Adversarial attacks apply perturbations to pixel space. The difference between spaces can change the determined epsilon parameter, which harms adversarial training. ImageNet normalization is represented by formula (2) below [27]:

xc′=(xc−μc)/σc(2)

here, c ∈ {R, G, B} denotes to the RGB channel, xc ∈ [0, 1] is the pixel space, and μc, σc denote the channel-wise mean and standard deviation computed from the ImageNet dataset. Adversarial perturbations are defined in the pixel space as x~c=xc+ε, where ε is the perturbation magnitude. The transformation to ImageNet space is computed as follows.

images

Thus, the effective perturbation in pixel space becomes Δ·xc = ε·σc. The effective perturbation is approximately 0.069, which is smaller than ε=0.3. This demonstrates that applying perturbations in normalized space leads to weaker adversarial perturbations during training and evaluation. This study proposes wrapping ImageNet normalization into the model. This adds the attack’s perturbations on pixels directly (within the pixel space 0, 1) and then performs normalization.

images

The value 0.9 corresponds to the correctly perturbed pixel value in the original space. Wrapping normalization in a model is the correct normalization mechanism for adversarial training. Adversarial training generates attack samples by identifying spots to fool the model based on the gradient direction, thereby increasing loss. The model was then trained on the generated adversarial samples to become robust against these spots, thereby reducing the loss. The gradient direction guides adversarial sample generation and guides model training on these samples. Eq. (3) represents the adversarial training equation [28].

min_E(x,y)[max_L(fθ(x+δ),y)]s.t.||δ||∞≤ε(3)

The proposed model was trained using a combination of clean and adversarial examples. Clean samples were injected with PGD and BIM samples, 32 batch size with (four CPU workers) for all the samples (1:1:1 Clean: PGD: BIM). The batch size consisted of 96 images. The gradients of the loss were computed by automatic differentiation in Pytorch and used to construct the PGD and BIM perturbations under the L-infinity constraint. These perturbations were dynamically updated at each iteration using the current model parameters. The training hyperparameters were assigned based on the default setup and the most common practice (ε = 8/255, α = 2/255, PGD-7 steps, BIM-10), including six training epochs and early stopping triggered by the validation loss (two epochs without improvements, the model will stop training), and a cross-entropy loss function (with label smoothing 0.1). AdamW optimizer (Learning Rate: 1e−4 and weight decay: 0.01 assigned manually without tuning). In this phase, the PGD and BIM generated adversarial samples retained the original ground-truth labels of their corresponding clean images, and the model was trained jointly on clean and perturbed samples using these labels. Formula (4) represents the PGD attack, and Formula (5) represents the BIM attack. Fig. 6 shows the adversarial training pipeline [29,30]. After the adversarial training, the model was fine-stuned for two epochs using clean training to adapt to domain shift on a mixed dataset composed of OCT-C8 and OCTID samples. During this stage, most model parameters were frozen, and updates were applied only to the classification head, the last transformer block, and the final normalization layer. The fine-tuning was performed on 5000 images and included only the shared classes between the two datasets.

x_adv = Proj_ε(x+α⋅ sign(∇_xJ(θ,x,y)))(4)

x_t∧(+1)= Clip_(x,ε∧∞)(x_t+α⋅ sign(∇_xJ(θ,x_t,y)))(5)

images

Figure 6: Adversarial training phase, including all subphases.

3.6 Evaluation Setup

The evaluation plan included several experiments to evaluate the proposed model (GreenShield-ViT). The study plan explored the pruning search space to select the optimal model. This comparison was used to investigate sustainability enhancements. t-Distributed Stochastic Neighbor Embedding (t-SNE) plots and attention rollout-visualization are investigated to ensure that clusters remain clearly separated and the attention is in the same relevant regions. Following adversarial training, the GreenShield-ViT model’s performance and calibration were assessed against adversarial samples. Five repeated stratified hold-out experiments were conducted to ensure statistical robustness. The experiments were performed using different random seeds (42, 123, 2024, 3407, and 9999). In each trial, the dataset was randomly split into 70% training, 15% validation, and 15% testing while preserving class distribution. The GreenShield-ViT is benchmarked against Mobile-ViT Extra Extra Small (XXS), ViT-Small (ViT-Tiny/16), and ViT-Tiny (ViT-Tiny/16). These baselines were selected from related work sections [8,10,19]. We explored their robustness against attacks after applying the proposed clean and adversarial training pipelines to all models. Investigating their robustness against attacks. The robustness evaluation is based on a small set of white-box attacks, including (FGSM, PGD, BIM, and the proposed hybrid model) [6]. The proposed hybrid attack is a combination of FGSM and PGD attacks. The hybrid attack begins with an FGSM step (based on the epsilon range and gradient direction) to make a significant jump in reaching the bound (the allowed value based on the epsilon range). Subsequently, if the gradient direction changes, it includes the PGD steps. If the direction remains the same, the steps are projected to prevent pixels from moving farther than ±ε. This attack is designed to initiate from the boundary rather than starting from the pixel value. This attack is introduced as an additional evaluation scenario to assess the robustness of the proposed model under mixed iterative attack strategies. This setting reflects a potential adversarial scenario where multiple attack strategies are combined to fool the model. This evaluation also assesses the robustness against unexpected attack strategies that were not explicitly used during training. The same evaluation hyperparameters were applied to all attacks. Default hyperparameters were selected (following widespread practice). The hyperparameters of the proposed evaluation attack are listed in Table 2. Additionally Fig. 7 shows how the adopted parameters can generate a visually imperceptible perturbation. The per-pixel difference was computed, and matplotlib was utilized to highlight the changed pixels (the highest perturbation magnitude).

images

Figure 7: Adversarial perturbations on the proposed hyperparameters.

After demonstrating that the proposed model is more robust than the baselines, the model was fine-tuned using a combined dataset from OCT-C8 and OCTID to mitigate domain shift and improve generalization. It was subsequently evaluated on three datasets (OCT-C8, UCSD-3, and OCTID), where UCSD-3 represents an unseen dataset for assessing generalization, and OCTID provides an additional evaluation under a different domain shift [25,26]. The UCSD-3 dataset was proposed solely for evaluation to ensure that the model can generalize well against external samples beyond the main dataset. This dataset is available on Kaggle with three labels (CNV, DRUSEEN, Normal). The test set contained 750 images. All the experiments were conducted using PyTorch and Torchvision. Finally, the model is evaluated under different epsilon levels (2/255, 4/255, 8/255, 16/255) to ensure that the model is robust beyond the common evaluation practice. Fig. 8 shows the attacks on the different epsilon levels. In addition, the model is evaluated against PGD-7 with three random restarts and Transfer-PGD attack, which was generated on a surrogate original ViT model before pruning (12-blocks).

images

Figure 8: Adversarial perturbations at different epsilon levels.

A set of evaluation metrics was adapted to interpret the findings and results. This study evaluates the model performance based on the Accuracy, Precision, Recall, F1-score, and average of max softmax probabilities to check the softmax saturation. Prior studies have demonstrated that the average max softmax probability (Avg MPS) is a valid metric to check softmax saturation [12,13]. When the maximum softmax probability ranges between (0.97–1.00), then the softmax is saturated. In addition, the Expected Calibration Error (ECE) measures the gap between the confidence and accuracy of the model (higher ECE indicates a higher gap) [31]. The efficiency of the model is evaluated using Floating-point operations per second (FLOPs), which is computed using fvcore and ptflops (in Python), Number of Model parameters, Model Size (in MB), Inference Time [32], Inference Memory usage [33], and Inference Carbon Footprint (CO2). The carbon footprint refers to the greenhouse gas emissions generated by inference measured in CO2-equivalent units. Carbon emissions were computed using the Green Algorithm estimation tool [34]. Pytorch Compute Unified Device Architecture (CUDA) was used for timing and memory measurements. The robustness of the Model was evaluated using the model’s accuracy against each attack [35], Attack Success Rate, the robustness gap between clean and adversarial accuracies, and the improvement rate before and after applying adversarial training [36]. The formula below represents the carbon footprint equation [34].

Carbon Footprint=Energy Needed*Carbon Intensity (6)

4 Experiments and Results

This section presents all the experiments and evaluations in this study, including the evaluation of the ViT model, exploration of the pruning search space, highlighting efficiency enhancements, and investigation of robustness against attacks.

4.1 Experiment One

In this experiment, the lightest architecture was selected without degrading accuracy by more than 2%. The results demonstrated that the 6-block model (pruning 6) was the optimal model, balancing performance and efficiency. The study selected the 6-block as the GreenShield-ViT model architecture. The 4-block model is the lightest, but the performance is reduced from 96.2% to 89.9% (6.3% reduction). While the 6-block model reduced from 96.2% to 95.6% (0.6% reduction), which is less than 2%. The 8-block model achieved an impressive performance, but the 6-block model achieved a competitive performance with significant enhancement of efficiency. The stability of block importance reveals that Transformer layers have unequal contributions, where lower-ranked blocks provide limited information and can be removed. Pruning up to 6 blocks maintains performance, whereas further pruning eliminates critical feature representation and results in performance degradation. All models were evaluated under identical experimental conditions. All experiments were conducted in a Kaggle notebook using an NVIDIA Tesla P100 GPU, with the same software framework (PyTorch and TIMM), identical input resolution (224 × 224), and the same test dataset. Under these standardized settings, all measurements were made to provide unbiased comparisons across models. The pruning was conducted based on 30 batches from the validation set. The results demonstrate that the top-rank blocks remain consistent across different batch sizes (30 and 40 batches), indicating that importance estimation is stable, and based on this observation, 30 batches were selected. Table 3 demonstrated stability in ranking. Table 4 presents the results of the pruning search space, and Fig. 9 shows the efficiency enhancement. Fig. 10 shows the ROC curve. The t-SNE plot (Fig. 11) shows that class separability is preserved after pruning, as the clusters remain clearly separated. In addition, the attention rollout visualization (Fig. 12) confirms that the model focuses on similar clinically relevant regions before and after pruning. Tables 5 and 6 demonstrate consistent performance across five trials.

images

Figure 9: Comparison of computational complexity between GreenShield-ViT and ViT/B-16.

images

Figure 10: ROC curve for the proposed GreenShield-ViT model.

images

Figure 11: T-SNE plot to ensure that the clusters remain clearly separated after pruning.

images

Figure 12: Attention-rollout visualization to show attention regions on both GreenShield-ViT and ViT/B-16 models.

images

The results indicate that reducing the number of transformer blocks does not significantly degrade performance beyond a certain depth. This is because earlier transformer layers capture the most critical structural and textural features of retinal OCT images, while deeper layers provide only marginal refinements. Given the structured nature of OCT data, redundant deeper representations can be removed without sacrificing discriminative capability. This explains why the 6-block configuration achieves a near-optimal balance between efficiency and accuracy, while the 8-block configuration offers minimal additional performance gains at a higher computational cost.

4.2 Experiment Two

This experiment investigated the robustness of the selected pruned model (6-Block) against adversarial attacks. Adversarial training was applied to the pruned model (GreenShield-ViT). This model represents the proposed approach, including all phases in this experiment. This experiment included the results before and after the proposed adversarial training to demonstrate the enhancement. The results indicated that the proposed adversarial training enhanced the robustness and calibration of the model. Before applying adversarial training, the model was unstable, achieving low accuracy with a high confidence rate. Table 7 shows the enhancement in the robustness. The calibration enhancements are demonstrated in Table 8 using the average of the maximum softmax probability (Avg. MSP) and expected calibration error (ECE). Fig. 13 illustrates the alignment between the accuracy and average MSP before and after adversarial training. As shown in Tables 9 and 10, the model demonstrates consistent performance across all trials, with minimal variation in both clean and adversarial accuracy, confirming its stability.

images

Figure 13: This figure demonstrates the alignment between Model accuracy and confidence (AVG MSP). (a) represents the results before applying the adversarial training; (b) presents the strong alignment after applying the adversarial training.

images

To address potential concerns regarding overfitting and result saturation, we conducted five independent experimental trials using different random seeds. The results, summarized in Tables 7 and 8, report mean, standard deviation, and 95% confidence intervals for all evaluation metrics. The observed low variance across trials confirms the stability and generalization capability of the proposed GreenShield-ViT model, even under adversarial conditions.

4.3 Experiment Three

This experiment aims to evaluate the robustness of the baselines (Mobile-ViT, ViT-Tiny, and ViT-Small) against adversarial attacks (FGSM, PGD, BIM, and the hybrid attack). The robustness performance is evaluated to compare baselines with the proposed model and identify the most robust, lightweight model. The same proposed pipeline was applied to all models. Exploring the robustness of these lightweight models addresses a research gap in prior OCT work. To ensure a fair comparison, all models were trained using identical data splits, preprocessing steps, hyperparameter settings, and adversarial training configuration, and evaluated under the same attack conditions. The results demonstrated that the models achieved impressive performance against clean samples, with a significant robustness gap. Furthermore, larger models are not necessarily more robust, as ViT-Small achieved an impressive performance against PGD and BIM samples, which were included in training, but performed poorly against FGSM and Hybrid attack, which indicates attack-specific overfitting and limited generalization. Table 11 presents the results of the baseline models, and Fig. 14 illustrates how the proposed model is significantly more robust than the baseline models, based on the success rate of each attack.

images

Figure 14: Comparison of robustness between GreenShield-ViT and baselines.

4.4 Experiment Four

This experiment aims to fine-tune the model on a mix of OCTID and OCT-C8 datasets. This fine-tuning helps the model to adapt to domain shift. After fine-tuning, the model was evaluated against three datasets: OCT-C8 (15% test set images), OCTID (115 images), and UCSD-3 (735 images). The results indicate that the proposed model achieves a high clean accuracy. The results against the OCTID dataset indicate that the model adapted to domain shift, while the results against UCSD-3 indicate that the model can generalize against unseen data. The significant improvement in the robustness after adversarial training demonstrates that the proposed pipeline builds a robust model. In addition, the findings illustrate a strong alignment between the accuracy and confidence levels after adversarial training. Table 12 lists the experimental results, while Tables 13–15 represent the confusion matrix across different datasets.

images

4.5 Experiment Five

This Experiment aims to evaluate the proposed model against varying epsilon levels to ensure that the model can generalize beyond the specific default parameters proposed for evaluation (default in ImageNet 8/255). The results demonstrate that the model can achieve competitive results under standard adversarial benchmark settings (2/255, 4/255, and 8/255), while maintaining acceptable robustness at 16/255, which exceeds commonly adopted research benchmarks. The model was evaluated against PGD-R3(3 random restarts) and Transfer-PGD scenarios. The experimental results are listed in Table 16. Fig. 15 represents the loss level against different epsilon values. The results indicate that the model can achieve a satisfactory performance against different scenarios, while the model does not exhibit gradient masking, as increasing the perturbation magnitude consistently leads to a higher loss and lower accuracy.

images

Figure 15: GreenShield-ViT loss levels against different epsilon values.

5 Discussion

This study investigated the robustness of the lightweight VIT architecture. In this study, the proposed GreenShield framework enhanced model efficiency through a gradient-based block-importance pruning, while robustness against white-box L-infinity attacks was enhanced through an integrated adversarial defense stage. The study contributions and findings are as follows:

• GreenShield is introduced as a novel empirically validated framework that integrates gradient-based block-importance pruning with adversarial training in retinal OCT. Achieving a significant sustainability enhancement, including FLOPs reduction by 49.6%, the model’s parameters were reduced by 49.6%. Inference memory usage was reduced by 33.8% in MB, faster inference in seconds by 46.8%, model size reduction by 49.3% in MB, and 47.1% lower emissions (CO2). The overall success rate for all attacks reduced across three datasets (OCT-c8, UCSD-3, and OCTID).

• GreenShield-ViT outperformed existing lightweight ViT variants (Mobile-ViT, ViT-Small, and ViT-Tiny) in terms of robustness against all attacks (FGSM, PGD, BIM, and a Hybrid Attack). Presenting reliable clinical decision-making and efficient integration into hospital systems.

• The experimental results indicate that adversarial robustness depends more on architectural design than model size. The ViT-small, the largest baseline model, suffers from overfitting, while the proposed model, with even greater capacity, achieves better generalization and maintains stable performance.

• Adversarial training was enhanced by wrapping normalization into the model, which improved perturbation generation. This issue has rarely been examined in prior OCT studies.

• Model calibration was significantly improved, demonstrating a strong alignment between the accuracy and confidence levels under attack conditions (stable confidence levels).

• The proposed framework jointly improves both robustness and efficiency, addressing a research gap in retinal OCT studies where these objectives have rarely been investigated together.

The proposed adversarial training enhanced the overall robustness of all the models included in this study. This study focuses on retinal OCT disease classification and fills a list of research gaps. Table 17 presents the gaps identified in previous studies and illustrates how they were addressed in the present study.

images

6 Limitations and Future Work

The proposed framework is limited to a specific type of attack, namely FGSM, PGD, and BIM (white-box attacks). This study can be extended to incorporate a larger set of attacks, such as L2 attacks (DeepFool, AutoAttack, and CW).

We consider hardware as a limitation of this study, where we utilized a GPU p100 (with a limited speed and budget). The available hardware does not allow us to investigate additional attacks or pruning techniques. This may open up more future paths, such as injecting iterative attacks (L2), while considering more powerful GPUs, such as NVIDIA RTX 4090. The significant improvement against L-infinity attacks (results before and after adversarial training) illustrates that the attacks included are important and can significantly harm the model performance, indicating that these attacks should be considered. Future workers could explore other pruning methods (such as magnitude pruning, Taylor pruning, and Soft pruning) to enhance efficiency and produce more robust architectures, while exploring more adversarial attacks and scenarios.

7 Conclusion

This study introduced GreenShield, a novel framework that produces a lightweight Vision Transformer with competitive adversarial robustness compared to existing lightweight ViT variants (Mobile-ViT, ViT-Tiny, and ViT-Small). Major gaps in prior OCT studies are addressed, including sustainability, robustness against white-box L-infinity attacks, and the development of reliable preprocessing pipelines. The proposed approach demonstrated that the ViT model can be lightweight and robust without degrading generalization performance. The proposed framework integrates gradient-based importance pruning for model compression and a robust adversarial training pipeline incorporating correct ImageNet normalization and anti-saturation techniques to mitigate softmax saturation. This approach is supported by a strong evaluation plan, including FGSM, PGD, BIM, and the proposed hybrid FGSM-PGD attack. The findings demonstrated significant efficiency enhancements, reducing FLOPs, parameters, carbon footprint emissions, model size, inference time, and memory usage by around 45%–50% across multiple efficiency metrics, while preserving generalization performance. In addition, the robustness of the GreenShield-ViT model significantly outperformed other lightweight ViT variants across all adversarial scenarios. A significant reduction in the attack’s success rate across three datasets (OCT-c8, OCTID, and UCSD-3), maintaining strong accuracy under multiple adversarial attacks while preserving generalization, and achieving dramatic improvements in calibration. This transforms the model from an overconfident system (with low accuracy) into a well-calibrated predictor. This paper presents a robust and lightweight model that can be extended to other medical imaging tasks. Future researchers can explore more pruning techniques to generate more efficient ViT architecture. In addition, this work can be extended to incorporate additional adversarial attacks, such as L2 black-box or white-box attacks.

Acknowledgement: The authors would like to acknowledge Kaggle for providing the computational environment (a GPU P100) used for training and evaluation, and the providers of the OCT datasets.

Funding Statement: The authors received no specific funding.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Munthir Qasaimeh, Mostafa Ali and Qasem Abu Al-Haija; Methodology, Munthir Qasaimeh, Mostafa Ali and Qasem Abu Al-Haija; Experiments and formal analysis, Munthir Qasaimeh; Writing—original draft, Munthir Qasaimeh; Writing—review & editing, Munthir Qasaimeh, Mostafa Ali and Qasem Abu Al-Haija; Visualization, Munthir Qasaimeh; Supervision, Mostafa Ali and Qasem Abu Al-Haija. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The data supporting the findings of this study are publicly available in the Kaggle repository. The Retinal OCT C-8 dataset is available at: https://www.kaggle.com/datasets/obulisainaren/retinal-oct-c8. The UCSD-3 OCT dataset is available at: https://www.kaggle.com/datasets/mmazizi/ucsd-3-class-labeled-retinal-oct-images. Source code is available at: https://github.com/muntherqasaimeh/VTwork.

Ethics Approval: Not applicable. This study used a publicly available dataset and did not involve data collection from human participants or animals.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Huang D, Swanson EA, Lin CP, Schuman JS, Stinson WG, Chang W, et al. Optical coherence tomography. Science. 1991;254(5035):1178–81. doi:10.1126/science.1957169. [Google Scholar] [PubMed] [CrossRef]

2. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16 x 16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations; 2021 May 3–7; Virtual. [Google Scholar]

3. Takahashi S, Sakaguchi Y, Kouno N, Takasawa K, Ishizu K, Akagi Y, et al. Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review. J Med Syst. 2024;48(1):84. doi:10.1007/s10916-024-02105-8. [Google Scholar] [PubMed] [CrossRef]

4. Gupta S, Kapoor M, Debnath SK. Challenges and risks of AI-enabled healthcare security. In: Artificial intelligence-enabled security for healthcare systems: safeguarding patient data and improving services. Cham, Switzerland: Springer; 2025. p. 101–12. doi:10.1007/978-3-031-82810-2_6. [Google Scholar] [CrossRef]

5. Ahmed S, Al Arafat A, Najafi D, Mahmood A, Rizve MN, Al Nahian M, et al. DeepCompress-ViT: rethinking model compression to enhance efficiency of vision transformers at the edge. In: Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2025 Jun 10–17; Nashville, TN, USA. doi:10.1109/CVPR52734.2025.02806. [Google Scholar] [CrossRef]

6. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. In: Proceedings of the 3rd International Conference on Learning Representations; 2015 May 7–9; San Diego, CA, USA. [Google Scholar]

7. Liew A, Agaian S. Comprehensive survey of OCT-based disorders diagnosis: from feature extraction methods to robust security frameworks. Bioengineering. 2025;12(9):914. doi:10.3390/bioengineering12090914. [Google Scholar] [PubMed] [CrossRef]

8. Kanca E, Ayas S, Baykal Kablan E, Ekinci M. Evaluating and enhancing the robustness of vision transformers against adversarial attacks in medical imaging. Med Biol Eng Comput. 2025;63(3):673–90. doi:10.1007/s11517-024-03226-5. [Google Scholar] [PubMed] [CrossRef]

9. Pavithra KC, Kumar P, Geetha M, Bhandary SV, Ajitha Shenoy KB, Rao G, et al. Transformer-based DME classification using retinal OCT images without data augmentation: an evaluation of ViT-B16 and ViT-B32 with optimizer impact. IEEE Access. 2025;13:180781–98. doi:10.1109/ACCESS.2025.3620945. [Google Scholar] [CrossRef]

10. Yang Y, Cai Z, Qiu S, Xu P. A novel transformer model with multiple instance learning for diabetic retinopathy classification. IEEE Access. 2024;12:6768–76. doi:10.1109/ACCESS.2024.3351473. [Google Scholar] [CrossRef]

11. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ, editors. Communications of the ACM. Santa Clara, CA, USA: Curran Associates, Inc.; 2017. p. 84–90. [Google Scholar]

12. Chen B, Deng W, Du J. Noisy softmax: improving the generalization ability of DCNN via postponing the early softmax saturation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. doi:10.1109/CVPR.2017.428. [Google Scholar] [CrossRef]

13. Pearce T, Brintrup A, Zhu J. Understanding softmax confidence and uncertainty. arXiv:2106.04972. 2021. [Google Scholar]

14. Zhai S, Likhomanenko T, Littwin E, Busbridge D, Ramapuram J, Zhang Y, et al. Stabilizing transformer training by preventing attention entropy collapse. In: Proceedings of the 40th International Conference on Machine Learning; 2023 Jul 23–29; Honolulu, HI, USA. [Google Scholar]

15. Bharath Kumar DP, Kumar N, Dunston SD, Rajam VMA. Analysis of the impact of white box adversarial attacks in ResNet while classifying retinal fundus images. In: Owoc ML, Sicily FEV, Rajaram K, Balasundaram P, editors. Computational intelligence in data science. Berlin/Heidelberg, Germany: Springer; 2022. p. 162–75. doi:10.1007/978-3-031-16364-7_13. [Google Scholar] [CrossRef]

16. Bhandari M, Shahi TB, Neupane A. Evaluating retinal disease diagnosis with an interpretable lightweight CNN model resistant to adversarial attacks. J Imag. 2023;9(10):219. doi:10.3390/jimaging9100219. [Google Scholar] [PubMed] [CrossRef]

17. Umer MJ, Sharif M, Raza M, Kadry S. A deep feature fusion and selection-based retinal eye disease detection from OCT images. Expert Syst. 2023;40(6):e13232. doi:10.1111/exsy.13232. [Google Scholar] [CrossRef]

18. Anvesh K, Reshmi BM, Hariharan S, Reddy HV, Krishnamoorthy M, Kukreja V, et al. A novel approach deep learning framework for automatic detection of diseases in retinal fundus images. Comput Model Eng Sci. 2025;143(2):1485–517. doi:10.32604/cmes.2025.063239. [Google Scholar] [CrossRef]

19. Akça S, Garip Z, Ekinci E, Atban F. Automated classification of choroidal neovascularization, diabetic macular edema, and drusenfrom retinal OCTimages using vision transformers: a comparative study. Lasers Med Sci. 2024;39(1):140. doi:10.1007/s10103-024-04089-w. [Google Scholar] [CrossRef]

20. Pan H, Miao J, Yu J, Dong J, Zhang M, Wang X, et al. A lightweight model for the retinal disease classification using optical coherence tomography. Biomed Signal Process Control. 2025;101:107146. doi:10.1016/j.bspc.2024.107146. [Google Scholar] [CrossRef]

21. Peng K, Huang D, Chen Y. Retinal OCT image classification based on MGR-GAN. Med Biol Eng Comput. 2025;63(6):1749–63. doi:10.1007/s11517-025-03286-1. [Google Scholar] [PubMed] [CrossRef]

22. Palaniappan D, Tak TK, Vijayan K, Maram B, Kshirsagar PR, Ahmad N. Enhancement of medical imaging technique for diabetic retinopathy: realistic synthetic image generation using GenAI. Comput Model Eng Sci. 2025;145(3):4107–27. doi:10.32604/cmes.2025.073387. [Google Scholar] [CrossRef]

23. Rahman S, Pal S, Fallah A, Doss R, Karmakar C. RAD-IoMT: robust adversarial defence mechanisms for IoMT medical image analysis. Ad Hoc Netw. 2025;178:103935. doi:10.1016/j.adhoc.2025.103935. [Google Scholar] [CrossRef]

24. Naren OS. Retinal OCT image classification—C8 [Internet]. 2024 [cited 2026 Apr 6]. Available from: https://www.kaggle.com/dsv/9595300. [Google Scholar]

25. Gholami P, Roy P, Parthasarathy MK, Lakshminarayanan V. OCTID: optical coherence tomography image database. Comput Electr Eng. 2020;81:106532. doi:10.1016/j.compeleceng.2019.106532. [Google Scholar] [CrossRef]

26. Goldbaum DKKZM. Large dataset of labeled optical coherence tomography (OCT) and chest X-ray images. Mendeley Data. 2018. doi:10.17632/rscbjbr9sj.3. [Google Scholar] [CrossRef]

27. Wang H, Lin W, Manoranjan P, Xiao G, Chan KL, Wang X, et al. Image and video technology. 1st ed. Cham, Switzerland: Springer; 2023. doi:10.1007/978-3-031-26431-3. [Google Scholar] [CrossRef]

28. Benz P, Ham S, Zhang C, Karjauv A, Kweon IS. Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. arXiv:2110.02797. 2021. [Google Scholar]

29. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: Proceedings of the 6th International Conference on Learning Representations 2018; 2018 Apr 30–May 3; Vancouver, BC, Canada. [Google Scholar]

30. Kurakin A, Goodfellow IJ, Bengio S. Adversarial examples in the physical world. In: Proceedings of the 5th International Conference on Learning Representations 2017; 2017 Apr 24–26; Toulon, France. [Google Scholar]

31. Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML 2017); 2017 Aug 6–11; Sydney, NSW, Australia. p. 1321–30. [Google Scholar]

32. Zheng X, Jia J, Dong S, Wang Y, Lu R, Chen Y, et al. Training and inference time efficiency assessment framework for machine learning algorithms: a case study for hyperspectral image classification. Int J Appl Earth Obs Geoinf. 2025;141:104591. doi:10.1016/j.jag.2025.104591. [Google Scholar] [CrossRef]

33. Yi C, Jian S, Tan Y, Zhang Y. HMO: host memory optimization for model inference acceleration on edge devices. In: Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC); 2024 Oct 6–10; Kuching, Malaysia. doi:10.1109/SMC54092.2024.10831215. [Google Scholar] [CrossRef]

34. Lannelongue L, Grealey J, Inouye M. Green algorithms: quantifying the carbon footprint of computation. Adv Sci. 2021;8(12):2100707. doi:10.1002/advs.202100707. [Google Scholar] [CrossRef]

35. Tang K, Lou T, He X, Shi Y, Zhu P, Gu Z. Enhancing adversarial robustness via anomaly-aware adversarial training. In: Knowledge science, engineering and management. Berlin/Heidelberg, Germany: Springer; 2023. p. 328–42. doi:10.1007/978-3-031-40283-8_28. [Google Scholar] [CrossRef]

36. Xiao C, Zhu JY, Li B, He W, Liu M, Song D. Spatially transformed adversarial examples. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018); 2018 Apr 30–May 3; Vancouver, BC, Canada. [Google Scholar]

Cite This Article

APA Style

Qasaimeh, M., Ali, M., Abu Al-Haija, Q. (2026). GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification. Computer Modeling in Engineering & Sciences, 147(2), 42. https://doi.org/10.32604/cmes.2026.080864

Vancouver Style

Qasaimeh M, Ali M, Abu Al-Haija Q. GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification. Comput Model Eng Sci. 2026;147(2):42. https://doi.org/10.32604/cmes.2026.080864

IEEE Style

M. Qasaimeh, M. Ali, and Q. Abu Al-Haija, “GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification,” Comput. Model. Eng. Sci., vol. 147, no. 2, pp. 42, 2026. https://doi.org/10.32604/cmes.2026.080864

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

GreenShield: A Lightweight and Robust Vision Transformer Framework in Retinal Disease Classification

Abstract

Keywords

References

Cite This Article

1183

384

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link