Open Access
ARTICLE
A Game-Theoretic Framework for Strategic Machine Unlearning in Backdoor Mitigation
Faculty of Data Science, City University of Macau, Macau, China
* Corresponding Author: Wenjian Liu. Email:
Computers, Materials & Continua 2026, 88(2), 25 https://doi.org/10.32604/cmc.2025.072458
Received 27 August 2025; Accepted 29 September 2025; Issue published 15 June 2026
Abstract
Backdoor attacks pose a critical threat to the reliability and trustworthiness of machine learning models, as they allow adversaries to manipulate model behavior through the injection of malicious patterns during training. Existing defenses, such as data filtering, fine-tuning, and model pruning, often lack provable guarantees or require retraining from scratch, resulting in significant computational costs. In this work, we propose GTMU (Game-Theoretic Machine Unlearning), a novel backdoor removal framework that formulates the unlearning process as a repeated game between the defender and a virtual attacker. The defender aims to strategically remove poisoned contributions while preserving benign knowledge, whereas the virtual attacker attempts to maintain the backdoor’s effectiveness. We introduce a Stackelberg game formulation to determine optimal unlearning policies and integrate a Nash equilibrium-based update rule to balance model utility and security. Our method leverages influence function approximations to estimate per-sample contribution and employs a regret-minimization strategy to adaptively select unlearning candidates. Experimental evaluations on image classification benchmarks under various backdoor settings demonstrate that GTMU consistently achieves over 95% clean accuracy while reducing backdoor success rates to below 2%, outperforming state-of-the-art backdoor defense methods in both efficiency and robustness. The proposed approach offers a theoretically grounded and computationally efficient solution for secure model deployment in adversarial environments.Keywords
Machine learning (ML) has witnessed unprecedented advancements over the past decade, enabling breakthroughs in diverse domains such as computer vision, natural language processing, autonomous systems, and healthcare diagnostics [1]. The deployment of ML models in safety-critical applications, however, has been accompanied by growing concerns about their robustness and security in adversarial environments. One of the most insidious and potent threats to the integrity of ML systems is the backdoor attack. In such an attack, an adversary injects carefully crafted poisoned samples into the training process, embedding a hidden malicious behavior that is only activated when a specific trigger pattern appears in the input. This allows the model to perform normally on clean data but misbehave in a highly predictable manner when the trigger is present, often redirecting predictions to a target label chosen by the attacker.
Backdoor attacks are particularly challenging to defend against because they exploit the same generalization capability that makes ML models powerful: the ability to learn from limited and diverse data. Even a small fraction of poisoned samples can be sufficient to implant a highly effective backdoor, especially in high-capacity models such as deep neural networks. The stealthy nature of these attacks means that models often achieve high accuracy on clean validation data, misleading conventional performance metrics and evading naive detection mechanisms.
A range of defense strategies has been proposed in the literature. Data-level defenses attempt to detect and remove poisoned training examples by analyzing statistical anomalies or reverse-engineering triggers. Model-level defenses focus on fine-tuning, pruning, or re-initializing parts of the model to weaken the backdoor functionality. Input-level defenses preprocess incoming samples to distort or neutralize triggers before they are fed to the model. While each of these categories offers valuable insights, they suffer from important limitations: data-level methods often require access to the full training set and risk removing benign samples; model-level approaches can inadvertently degrade clean accuracy and require significant retraining; input-level defenses are typically reactive and may fail against adaptive attacks. Furthermore, many existing defenses operate under strong assumptions, such as knowledge of the trigger pattern or the availability of extensive auxiliary datasets, which are unrealistic in many real-world scenarios.
Machine unlearning offers a fundamentally different perspective. Originating from privacy-driven requirements such as the GDPR’s “right to be forgotten,” unlearning techniques enable the targeted removal of the influence of specific training samples from an already-trained model without the need to retrain from scratch. This paradigm is particularly well-suited for mitigating backdoor attacks, as it allows defenders to surgically remove suspected poisoned contributions while retaining the majority of benign knowledge. However, directly applying existing unlearning algorithms to adversarially poisoned data is nontrivial. Without careful design, naive unlearning can erase important benign knowledge, destabilize model representations, and fail to fully remove the backdoor.
In this paper, we introduce a game-theoretic framework for machine unlearning tailored to backdoor removal. We formalize the interaction between the defender, who aims to identify and remove the influence of poisoned data while preserving clean accuracy, and a virtual attacker, who seeks to maximize the persistence of the backdoor after unlearning. This formulation captures the inherent strategic nature of the problem, where both sides adapt to each other’s moves. Our approach, termed GTMU (Game-Theoretic Machine Unlearning), combines the predictive power of influence function approximations with a regret-minimization strategy to identify high-impact poisoned contributions and remove them with minimal collateral damage to benign knowledge. By leveraging Stackelberg game principles to anticipate the attacker’s responses and Nash equilibrium conditions to balance competing objectives, GTMU provides a principled method for robust unlearning.
Our main contributions are as follows:
1. We formulate machine unlearning for backdoor mitigation within a game-theoretic framework, explicitly modeling the adaptive interplay between defender and attacker strategies. While game-theoretic approaches have been explored in adversarial ML, to our knowledge this is the first work applying them directly to the unlearning problem.
2. We propose the GTMU algorithm, which integrates influence function-based sample scoring with regret-minimization dynamics to efficiently select unlearning candidates under computational constraints.
3. We provide a theoretical analysis of the convergence and equilibrium properties of our method, offering guarantees on the trade-off between backdoor removal effectiveness and preservation of clean accuracy.
4. We conduct extensive experiments on benchmark datasets such as CIFAR-10, GTSRB, and ImageNet-Subset under multiple backdoor scenarios, demonstrating that GTMU consistently reduces backdoor success rates to below 2% while maintaining above 95% clean accuracy, outperforming state-of-the-art defenses in both performance and efficiency.
The remainder of this paper is organized as follows. Section 3 introduces the fundamental concepts of machine unlearning, backdoor attacks, and game theory necessary for understanding our framework. Section 4 details the design of GTMU, including its influence-based scoring mechanism and game-theoretic optimization. Section 5 presents our experimental setup, results, and comparative analysis. Section 2 reviews prior research in backdoor defense, unlearning, and adversarial game theory. Finally, Section 6 concludes with a discussion of potential extensions and broader implications.
Our work lies at the intersection of three key research areas: backdoor attacks and defenses, machine unlearning, and game-theoretic approaches to adversarial machine learning. In this section, we review representative and influential contributions in each area.
Backdoor Attacks and Defenses. Backdoor attacks, first popularized by the BadNets framework [2,3], embed a hidden malicious behavior into a trained model by injecting poisoned samples with a fixed trigger pattern into the training data. Since then, various attack strategies have been proposed to increase stealthiness and robustness. The Blend attack [4] hides the trigger by blending it into the entire image at low opacity, making detection harder. TrojanNN [5] learns an adaptive trigger jointly with model parameters, significantly improving attack persistence. Other notable attacks include invisible perturbation-based triggers [6], input-agnostic triggers [7], and sample-specific triggers [8], all of which pose unique challenges for defenses.
Defensive strategies against backdoors fall into three main categories. Data-level defenses detect and filter poisoned samples, often using statistical anomaly detection [9,10] or reverse-engineering triggers [11–13]. Model-level defenses modify the model to weaken backdoor activation, such as neuron pruning [14], fine-tuning [15], or parameter regularization [16]. Input-level defenses preprocess inputs to disrupt triggers, e.g., via transformations [17] or adversarial perturbations [18]. While effective in some settings, many defenses require access to the full training dataset, incur significant retraining costs, or cause non-negligible accuracy degradation on clean data.
Machine Unlearning. Machine unlearning, initially motivated by privacy regulations such as the GDPR [19], focuses on removing the influence of specific training data without retraining from scratch [20,21]. Early approaches include SISA training [22,23], which partitions data and models to allow efficient retraining of affected shards, and exact unlearning via retraining [24,25]. More recent work explores approximate unlearning using influence functions [26,27], gradient updates [28], and variational methods [29]. Unlearning has also been applied to federated learning [30,31] and continual learning [32], but most existing methods focus on privacy compliance rather than adversarial robustness, leaving the potential for targeted backdoor removal underexplored.
Game-Theoretic Approaches to Adversarial ML. Game theory provides a principled framework for modeling interactions between defenders and adversaries in machine learning. Prior work has applied Stackelberg games to model poisoning and evasion attacks [33,34], Nash equilibrium analysis for robust training [35,36], and zero-sum games for adversarial example generation [37,38]. In backdoor defense, game-theoretic thinking has been used implicitly in adaptive training strategies [39,40] and explicitly in multi-agent defense formulations [41,42]. However, to our knowledge, no prior work has framed backdoor removal via machine unlearning as a repeated game, nor combined influence-based targeting with regret-minimized adaptation in a formal game-theoretic setting.
Positioning of Our Work. Our GTMU framework builds on the influence-based unlearning literature [26,27] and integrates ideas from repeated game analysis [43,44] to design a dynamic defense that adapts to attacker persistence strategies. Compared to traditional backdoor defenses [9,11,14] and unlearning methods [22,28], GTMU uniquely models the defense as a Stackelberg game, enabling proactive rather than purely reactive mitigation. This strategic formulation, combined with efficient influence-based sample removal, allows GTMU to achieve both high effectiveness in backdoor suppression and minimal harm to clean accuracy, even in large-scale and adaptive threat settings.
In this section, we formalize the threat model, introduce the notations used throughout the paper, and briefly review the theoretical foundations of machine unlearning, backdoor attacks, and relevant concepts from game theory.
Let
where
We denote the set of poisoned samples by
In the backdoor threat model considered here, an adversary injects a small subset of poisoned samples into the training data. A poisoned sample
where
Machine unlearning aims to remove the influence of a specific subset
with respect to model predictions and generalization performance. The key challenge lies in achieving this approximation with minimal computational overhead while ensuring complete removal of
Influence functions approximate the effect of removing a training point
where
We model the backdoor removal process as a two-player game:
• Defender: selects a set of samples to unlearn with the goal of minimizing BSR while keeping CA high.
• Virtual Attacker: anticipates the defender’s strategy and attempts to maximize the residual BSR after unlearning.
The defender acts as the leader in a Stackelberg game, committing to an unlearning policy
Imagine a two-player game. The defender picks an unlearning budget: Small or Large . The attacker picks trigger strength: Low or High. Think of the defender’s payoff as “clean accuracy minus compute cost” and the attacker’s as “backdoor success.” If the defender chooses, the attacker prefers High (stronger attack wins); if the defender chooses Large, the attacker prefers Low (strong attacks no longer pay off). In a simultaneous-move (Nash) game, best responses cross and no pure Nash point exists; both sides hedge with mixed strategies. Intuitively, when you cannot commit first, you act cautiously because the other side might go harder.
Now switch to a Stackelberg (leader–follower) game where the defender commits first and the attacker reacts. Looking ahead, the defender knows that Small will invite High (bad outcome), while Large will induce Low (better outcome overall). So the defender commits to Large to shape the attacker’s best reply. This mirrors GTMU: by choosing a stronger unlearning move (or budget) up front—guided by influence scores and simple regret updates—the defender makes high-intensity triggers less attractive, improving the final trade-off between attack success and accuracy compared with the simultaneous-play baseline.
We now introduce GTMU (Game-Theoretic Machine Unlearning), our proposed framework for strategically removing backdoors from trained machine learning models. The method formulates backdoor mitigation as a repeated game between a defender and a virtual attacker, combining influence function analysis with regret-minimization to identify and unlearn poisoned data efficiently while preserving clean performance.
We begin by formalizing the setting of game-theoretic machine unlearning for backdoor mitigation. Let the original training dataset be denoted as
where
Given a trained model
We quantify the defender’s two main objectives as:
• Backdoor Success Rate (BSR): The fraction of trigger-embedded inputs classified as the attacker’s target label. The defender seeks to minimize this quantity.
• Clean Accuracy (CA): The classification accuracy on benign test inputs. The defender aims to keep this above a specified threshold
Formally, let
where
Similarly, the attacker’s utility is modeled as:
where
The defender and attacker play a repeated game over T rounds. In each round:
1. The defender selects
2. The virtual attacker responds with
3. The resulting BSR and CA are evaluated, and the defender updates
This setup captures the strategic nature of backdoor removal. An optimal unlearning strategy must anticipate and counteract the attacker’s adaptations. The interplay between
• A Stackelberg game, where the defender acts as the leader and the attacker as the follower,
• A simultaneous-move game, where both sides choose strategies without knowledge of the other’s immediate choice, leading to a Nash equilibrium.
In this work, we adopt the Stackelberg formulation, as it aligns naturally with the operational reality of defenses: defenders commit to an unlearning policy first, and attackers adapt afterward. This allows us to preemptively shape the attacker’s best response and design unlearning actions that are robust to the worst-case persistence strategies.
4.2 Influence-Based Candidate Selection
A core challenge in machine unlearning for backdoor mitigation is determining which training samples to target for removal. Since retraining from scratch is computationally prohibitive, we require a principled mechanism to identify the subset of samples whose removal will most effectively disrupt the backdoor while minimally harming benign model behavior. To achieve this, we leverage the theory of influence functions, which approximate the effect of individual training samples on model predictions by analyzing the model’s loss landscape.
Let
where:
•
•
•
A higher value of
where K is a truncation parameter controlling the approximation depth. This enables efficient computation of influence scores for large-scale models.
Once the influence scores
where
Integration with Game-Theoretic Strategy
The influence scores serve as a prior for the defender’s unlearning policy
By systematically identifying high-impact samples via influence functions, we ensure that the unlearning process remains both targeted (focusing on poisoned data) and efficient (minimizing computational overhead), laying the foundation for the adaptive, game-theoretic updates that follow.
After identifying a candidate set of samples
From Full Retraining to Approximate Updates
Let
in terms of prediction behavior. Using influence functions [26], the parameter difference caused by removing
This expression provides a first-order correction that moves the parameters in the opposite direction of the gradient contributions from the removed samples.
Computing the exact inverse Hessian
Let
where (i)
A crucial consideration is that the removal of poisoned samples should not disproportionately degrade benign performance. To achieve this, we monitor the clean accuracy after each unlearning step and introduce a corrective term if CA falls below the threshold
where
The unlearning update rule is applied repeatedly over T iterations, each time with a newly selected
We formalize guarantees for GTMU under a local, regularized view of the training landscape and an inexact, damped, preconditioned second–order update with residual control. These statements are intended as practical, local diagnostics—not global convergence claims for deep nonconvex networks.
Notation.
Let
Assumptions.
We work in a neighborhood
A1 (Local smoothness)
A2 (Damped positive definiteness)
A3 (Preconditioner)
A4 (Inexact solve)
A5 (Safeguard) Step acceptance uses Armijo backtracking or a trust region; if not accepted,
A6 (Influence oracle) Influence scores
GTMU uses
where
which is
4.4.1 Descent and Stability of the Second-Order Step
Theorem 1 (Surrogate descent under inexact damped Newton): Under A1–A5, any accepted step of (14) satisfies
for some Armijo constant
Proof sketch. The Armijo condition guarantees a quadratic decrease in the
Corollary 1 (Bounded steps and iterate stability): If
4.4.2 Quality of the Inexact Solve and the Role of Preconditioning
Theorem 2 (CG/LiSSA rate with preconditioning): Let
Equivalently, the residual criterion in A4 implies
Proof sketch. Follows from classical PCG theory on the preconditioned normal equations and coercivity of
Theorem 3 (Linear rate in the quadratic case): If L is (locally) quadratic with Hessian
where
4.4.3 Selection Guarantees for Unlearning Candidates
Definition 1 (Weak submodularity): Let
Theorem 4 (Approximation under weak submodularity and oracle error): Suppose
where
Proof sketch. Standard analysis for weakly submodular maximization with noisy oracles, using telescoping marginal gains and the definition of
Proposition 1 (Sample complexity for noisy influence estimates): Assume each marginal gain is estimated by an average of N i.i.d. probes with variance proxy
Theorem 5 (No-regret
where
4.4.4 Robustness, Stability, and Generalization Effects
Proposition 2 (Prediction stability under bounded steps): Let the model’s logits be L-Lipschitz in parameters. If
Proposition 3 (Influence debiasing bound): Let
4.4.5 Putting the Pieces Together
Theorem 6 (Composite progress per round): Fix a round
Averaging over rounds with no-regret attacker dynamics yields the CCE guarantee in Theorem 5 with additional additive penalties from oracle noise and HVP inexactness.
We evaluate the effectiveness and efficiency of our proposed GTMU framework on multiple benchmark datasets and backdoor attack settings. Our primary objectives are to measure: (i) the reduction in Backdoor Success Rate (BSR) after unlearning, (ii) the preservation of Clean Accuracy (CA), and (iii) the computational efficiency compared to baseline methods. We also conduct ablation studies to assess the contributions of individual components such as influence-based selection and regret minimization.
In this subsection, we present the details of our experimental design, including the datasets, model architectures, backdoor attack implementations, baseline defense methods, and evaluation metrics. These choices were made to ensure that our evaluation of GTMU is both comprehensive and representative of realistic deployment environments in which backdoor threats may arise.
Datasets
We evaluate GTMU on three datasets that differ significantly in scale, complexity, and application domain, ensuring that our conclusions are not limited to a single data distribution.
• CIFAR-10 [45] is a small-scale but widely used dataset containing 60,000
• GTSRB [46] (German Traffic Sign Recognition Benchmark) contains 51,839 color images of 43 traffic sign categories, with significant intra-class variation due to weather, lighting, and viewing angles. The dataset is relevant for safety-critical systems such as autonomous driving, where backdoor vulnerabilities can have severe real-world consequences. We use the standard training/test split provided by the dataset.
• ImageNet-Subset [47] is a reduced-scale variant of the ImageNet dataset, containing 50 randomly chosen categories from the full 1000-class ImageNet benchmark. Each category contains roughly 1300 training images and 50 validation images. This subset allows us to test GTMU on large-scale, high-resolution data while keeping computational demands manageable.
Model Architectures
For CIFAR-10, we adopt a ResNet-18 backbone [48], which balances performance and computational efficiency. For GTSRB, we use a VGG-16 model [49], which has been widely used in traffic sign recognition research. For ImageNet-Subset, we employ a ResNet-50 architecture to handle the higher complexity and resolution of the images. All models are trained using standard data augmentation techniques (random cropping, horizontal flipping, and normalization) and optimized with stochastic gradient descent (SGD) with momentum.
Backdoor Attack Implementations
We consider three representative and widely studied backdoor attack types:
• BadNets [2]—a static trigger consisting of a small, fixed white square pattern placed in the lower-right corner of the image. This attack is easy to implement but highly effective.
• Blend [4]—a trigger blended into the entire image with a fixed transparency factor. This attack is harder to detect through simple pattern matching and can evade certain preprocessing defenses.
• TrojanNN [50]—an adaptive backdoor where the trigger pattern is learned jointly with the model parameters to maximize stealthiness and effectiveness.
For all attacks, the poisoning rate is set to
Baselines
We compare GTMU against several state-of-the-art unlearning and backdoor defense methods:
• Fine-Pruning [14]—detects and prunes neurons highly activated by trigger patterns.
• FT-Unlearning [28]—fine-tunes the model using only clean data after removing specific samples.
• IF-Unlearning [27]—employs influence functions to guide the removal of training data contributions.
• Random Removal—randomly selects the same number of samples for removal as GTMU, serving as a control to measure the importance of targeted unlearning.
Evaluation Metrics
To assess the effectiveness of each method, we report:
• Clean Accuracy (CA)—the classification accuracy on clean (benign) test images.
• Backdoor Success Rate (BSR)—the classification accuracy on trigger-embedded test images, indicating the strength of the remaining backdoor.
• CA Drop—the absolute drop in clean accuracy relative to the poisoned model before unlearning.
• Time—the average wall-clock time per unlearning iteration, measuring computational efficiency.
Implementation Details
For GTMU, we set the number of unlearning iterations
In this subsection, we present an extensive quantitative evaluation of GTMU against a diverse set of baselines, covering multiple datasets, attack types, and model architectures. We assess three primary metrics: Clean Accuracy (CA), Backdoor Success Rate (BSR), and computational efficiency (average wall-clock time per unlearning iteration). Additionally, we provide graphical visualizations of CA–BSR trade-offs and runtime scaling to offer a more intuitive understanding of the results. The experiments confirm that GTMU delivers state-of-the-art performance in suppressing backdoors while maintaining high clean accuracy and reasonable computational overhead.
5.2.1 Overall Performance across Datasets
Table 1 provides a comprehensive comparison of GTMU and four representative baselines: Fine-Pruning [14], FT-Unlearning, IF-Unlearning [26,27], and Random Removal. For each dataset and attack type, the table reports CA, BSR, and the resulting CA drop relative to the poisoned model prior to unlearning.

GTMU consistently reduces BSR to under
5.2.2 CA-BSR Trade-off Visualization
To provide a clearer picture of how GTMU balances CA preservation and BSR reduction, Fig. 1 plots CA against BSR for all methods on CIFAR-10 under BadNets. GTMU occupies the top-left corner of the plot, indicating its dominance in both metrics. Baselines such as Fine-Pruning achieve low BSR but at the cost of significantly reduced CA, while IF-Unlearning and FT-Unlearning achieve moderate trade-offs but lag behind GTMU in BSR suppression.

Figure 1: CA–BSR trade-off for CIFAR-10 under BadNets. black circle: GTMU (1.7% BSR, 95.3% CA), red square: IF-Unlearning (3.2%, 94.9%), green triangle: FT-Unlearning (2.6%, 94.4%), orange diamond: Fine-Pruning (2.9%, 92.0%), black cross: Random Removal (7.5%, 93.1%)
Table 2 presents runtime scaling with dataset size for GTMU and key baselines. We measure average wall-clock time per iteration for progressively larger subsets of ImageNet-Subset. Fig. 2 visualizes the scaling trend, showing that GTMU remains competitive with IF-Unlearning while offering far superior unlearning effectiveness.


Figure 2: Runtime scaling with dataset size on ImageNet-Subset. black circle: GTMU (23.7, 33.6, 39.9, 46.3 s), red square: IF-Unlearning (24.8, 35.3, 41.7, 48.6 s), green triangle: FT-Unlearning (32.5, 46.1, 54.8, 61.7 s)
Summary of Key Findings
From these results, we observe that GTMU:
1. Consistently achieves the lowest BSR across all tested datasets and attack types, remaining below
2. Maintains high CA, with the CA drop always under
3. Operates with competitive efficiency, avoiding the high costs of retraining-heavy methods such as FT-Unlearning.
4. Scales well with dataset size and model complexity, making it suitable for large-scale deployment.
The combination of precision targeting through influence estimation and adaptive iteration via regret minimization is key to GTMU’s superior performance profile.
To better understand the dynamic behavior of GTMU during the unlearning process, we analyze its convergence properties in terms of both Clean Accuracy (CA) and Backdoor Success Rate (BSR) over multiple iterations. The goal of this analysis is to examine how quickly GTMU is able to suppress the backdoor and how stable the clean accuracy remains during successive unlearning steps.
We perform this analysis on the CIFAR-10 dataset under the BadNets attack, using a ResNet-18 model. The initial poisoned model exhibits a CA of
We observe that BSR drops sharply within the first three iterations, from
1. Rapid Backdoor Suppression: The steep early reduction in BSR suggests that a relatively small number of high-impact poisoned samples account for most of the backdoor’s effectiveness, and GTMU efficiently targets them.
2. Accuracy Preservation: The minimal CA fluctuation confirms that GTMU’s unlearning update rule effectively preserves benign decision boundaries, even while aggressively removing poisoned contributions.
The flat curves beyond iteration 3 also suggest that in practical deployments, GTMU can terminate early after the BSR falls below a desired threshold, saving computation without sacrificing defense quality.
Per-Iteration Metrics and Sensitivity
Table 3 reports the exact CA and BSR values across the


Figure 3: Convergence of GTMU on CIFAR-10 (BadNets). black circles: CA (stays between
Early-Stopping Sensitivity
We also assess the impact of the per-round removal budget


Figure 4: Effect of per-iteration removal budget
To better understand the contributions of each component in GTMU, we conduct an extensive ablation study on both CIFAR-10 (BadNets attack) and GTSRB (Blend attack). This analysis isolates the impact of (i) influence-based selection and (ii) regret minimization, allowing us to quantify how much each design choice contributes to backdoor suppression and clean accuracy preservation. We construct the following variants:
• GTMU w/o Influence Selection: Replaces the influence-based scoring
• GTMU w/o Regret Minimization: Keeps influence-based scoring but removes the multiplicative weights adaptation, fixing
• Full GTMU: The complete framework integrating both influence-based selection and regret minimization.
For CIFAR-10, we use ResNet-18 under the BadNets attack with a poisoning rate of
• Clean Accuracy (CA): Accuracy on clean test data.
• Backdoor Success Rate (BSR): Accuracy on trigger-embedded test data.
• CA Drop: Difference in CA before and after unlearning.
Each result is averaged over three independent runs to reduce stochastic variance.
Tables 5 and 6 present the results. On CIFAR-10, removing influence selection increases BSR from
1. Influence-based selection is critical. Across both datasets, removing it causes BSR to more than triple compared to the full GTMU, confirming that precise identification of high-impact poisoned samples is central to successful backdoor removal.
2. Regret minimization enhances stability. While its absence does not catastrophically harm BSR, it leads to larger CA drops and slightly higher residual BSR, indicating that adaptive reweighting helps preserve benign performance while cleaning up lingering backdoor effects.
3. The combination is necessary for robustness. The full GTMU consistently delivers the lowest BSR and smallest CA drop, demonstrating the complementary benefits of both components.


Overall, the ablation study validates that both influence-based targeting and regret-minimized adaptation are indispensable for achieving the high performance observed in our main experiments.
We broaden our evaluation (adaptive attacks; federated and NLP domains), add statistical testing and diagnostics, and include a detailed runtime/complexity study. We also provide ablations, sensitivity analyses, cost-effectiveness measurements, and failure-mode discussion to clarify where GTMU helps and where it may struggle.
5.5.1 Results on Extended Threat Model
We assume a standard backdoor threat model where an adversary injects a small fraction of poisoned samples and selects a target label. For adaptive attacks, the adversary observes the defense and iteratively modifies triggers or gradients (e.g., PGD-style trigger optimization) to maintain attack success. Our evaluation protocol alternates (i) defense response via GTMU and (ii) adversary adaptation for
Table 7 summarizes method-wise results on three vision benchmarks. GTMU attains the lowest attack success rate (ASR) while maintaining competitive clean accuracy (CA), and ranks best on both ASR and CA aggregates. Importantly, its gains persist even on the larger ImageNet-Subset. Table 8 isolates three adaptive attacks and compares GTMU with the strongest baseline (retrain). GTMU reduces ASR by roughly


5.5.2 Sensitivity to Candidate Budget and HVP Iterations
We vary the unlearning batch size

Beyond centralized vision benchmarks, we test (i) federated learning (FL) with

We also inspect CIFAR-10 target classes with highest residual ASR after defense. Table 11 shows GTMU reduces disproportionate vulnerability (e.g., “Truck”, “Cat”) relative to baselines, but long-tailed classes remain slightly more attack-prone. This motivates adaptive budgeting (larger

5.5.3 Statistical Significance and Effect Sizes
We perform paired

We monitor LiSSA/HVP residuals and inverse-Hessian approximation error. Table 13 reports the fraction of runs meeting tolerance and mean relative error; most runs converge reliably, with slightly higher error on ImageNet-Subset due to depth/scale.

We compare wall-clock time, GPU-hours (GH, 4
We compute “hours per 1% ASR reduction” relative to the poisoned baseline (lower is better). GTMU is


We introduced GTMU, a game-theoretic method for machine unlearning that targets backdoor attacks in trained models. We model a defender and an adaptive attacker as a Stackelberg game so the defender can plan ahead and remove the most harmful poisoned samples. GTMU mixes influence-based scoring to find high-impact samples with a light regret loop to refine choices over several steps. On CIFAR-10, GTSRB, and ImageNet-Subset, under attacks such as BadNets, Blended Injection, and TrojanNN, GTMU lowers backdoor success to below
We note a few scope considerations and opportunities for further refinement. While our experiments focus on vision, where triggers are spatial and continuous, extending to text, speech, graphs, or tables introduces domain-specific factors (e.g., discrete tokens and task constraints on allowable edits) that we view as natural next steps. For very large models, Hessian–vector products and influence estimates can be computationally demanding and occasionally noisy, and memory limits together with challenging curvature may call for additional engineering—beyond our current damping, preconditioning, and trust-region safeguards—to fully unlock scalability. Finally, because repeated unlearning updates adjust parameters iteratively, small numerical drift in weights or logits can accumulate; lightweight recalibration (e.g., periodic anchoring to a clean checkpoint or class-wise calibration) appears promising for maintaining accuracy and mitigating any emergent bias over time.
Looking ahead, we plan to carry GTMU beyond vision. For NLP and speech, we will use token- or sequence-level influence and trigger sets made for each domain; for graphs and tables, we will adapt scoring to nodes/edges or fields/rows and respect domain rules. For foundation models, we will pursue parameter-efficient updates (e.g., LoRA/adapters), light preconditioners, curvature clipping, mixed-precision, and distributed HVPs, and we will study unlearning in RAG and multi-modal settings where triggers can come from retrieved text or images. To control drift, we will adopt an unlearn-then-recalibrate routine: trust-region steps with damping, periodic re-anchor to a clean checkpoint (EMA or a short clean fine-tune), class-wise calibration, clear early-stop rules when drift passes a threshold, and simple logs of residual ratio, step norm, and expected ASR drop to decide when to fall back to a first-order step. We will also test stronger adaptive attacks, federated settings with client dropout, and safety tasks in text, and we will work to widen the valid region of our local surrogate, tighten selection bounds, and link our regret dynamics to limits on average attack success.
In short, GTMU is a practical defense with clear gains in vision today and a realistic path to broader use. Addressing the limits above will help make it reliable for language, multi-modal, and very large models while keeping drift under control during repeated unlearning.
Acknowledgement: Not applicable.
Funding Statement: The authors received no specific funding for this study.
Author Contributions: Xiaolei Ding contributed to the conceptual design, theoretical analysis, and drafting of the manuscript. Wenjian Liu supervised the project, guided the methodology, and refined the final manuscript. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: All datasets used in this study (e.g., CIFAR-10, GTSRB, and ImageNet-Subset) are publicly available from their original sources.
Ethics Approval: This study used publicly available benchmark datasets and did not involve human participants, animals, or personally identifiable information.
Conflicts of Interest: The authors declare no conflicts of interest.
1We note that GTMU’s distribution over unlearning candidates does not perform substantially worse, in hindsight, than the best fixed mixed strategy, up to regret and second–order inexactness.
References
1. Sarabdeen J, Mohamed Ishak MM. A comparative analysis: health data protection laws in Malaysia, Saudi Arabia and EU General Data Protection Regulation (GDPR). Int J Law Manag. 2025;67(1):99–119. doi:10.1108/ijlma-01-2024-0025. [Google Scholar] [CrossRef]
2. Gu T, Dolan-Gavitt B, Garg S. Badnets: identifying vulnerabilities in the machine learning model supply chain. In: Proceedings of Machine Learning and Computer Security Workshop; 2017 Dec 8; Long Beach, CA, USA. [Google Scholar]
3. Zhang S, Pan Y, Liu Q, Yan Z, Choo KKR, Wang G. Backdoor attacks and defenses targeting multi-domain ai models: a comprehensive review. ACM Comput Surv. 2024;57(4):1–35. doi:10.1145/3704725. [Google Scholar] [CrossRef]
4. Chen X, Liu C, Li B, Lu K, Song D. Targeted backdoor attacks on deep learning systems using data poisoning. In: Proceedings of the 10th Workshop on Artificial Intelligence and Security; 2017 Nov 3; Dallas, TX, USA. p. 27–35. [Google Scholar]
5. Bai Y, Xing G, Wu H, Rao Z, Ma C, Wang S, et al. Backdoor attack and defense on deep learning: a survey. IEEE Trans Comput Soc Syst. 2025;12(1):404–34. [Google Scholar]
6. Li Y, Li T, Wang B. Invisible backdoor attacks on deep neural networks via steganography and deep image prior. IEEE Trans Dependable Secure Comput. 2021. doi:10.1109/tdsc.2020.3021407. [Google Scholar] [PubMed] [CrossRef]
7. Turner A, Tsipras D, Madry A, Schmidt L. Label-consistent backdoor attacks. arXiv:1912.02771. 2019. [Google Scholar]
8. Nguyen A, Tran A, Tran L. Input-aware dynamic backdoor attack. In: Advances in neural information processing systems. Cambridge, MA, USA: MIT press; 2020. [Google Scholar]
9. Tran B, Li J, Madry A. Spectral signatures in backdoor attacks. In: Advances in neural information processing systems. Cambridge, MA, USA: MIT press; 2018. p. 8000–10. [Google Scholar]
10. Chen B, Carvalho W, Baracaldo N, Ludwig H, Edwards B, Lee T, et al. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv:1811.03728. 2018. [Google Scholar]
11. Wang B, Yao Y, Shan S, Li H, Viswanath B, Zheng H, et al. Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In: Proceedings of the 2019 IEEE Symposium on Security and Privacy; 2019 May 19–23; San Francisco, CA, USA. p. 707–23. [Google Scholar]
12. Pan Z, Ying Z, Wang Y, Zhang C, Li C, Zhu L. One-shot backdoor removal for federated learning. IEEE Internet Things J. 2024;11(23):37718–30. doi:10.1109/jiot.2024.3438150. [Google Scholar] [PubMed] [CrossRef]
13. Zhao S, Tuan LA, Fu J, Wen J, Luo W. Exploring clean label backdoor attacks and defense in language models. IEEE/ACM Trans Audio Speech Lang Process. 2024;32(1):3014–24. doi:10.1109/taslp.2024.3407571. [Google Scholar] [PubMed] [CrossRef]
14. Liu K, Dolan-Gavitt B, Garg S. Fine-pruning: defending against backdooring attacks on deep neural networks. In: Research in attacks, intrusions, and defenses. Cham, Switzerland: Springer; 2018. p. 273–94. doi:10.1007/978-3-030-00470-5_13. [Google Scholar] [CrossRef]
15. Yao Y, Li H, Zheng H, Zhao BY. Latent backdoor attacks on deep neural networks. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security; 2019 Nov 11–15; London, UK. p. 2041–55. [Google Scholar]
16. Li Y, Zhai YJ, Wu Y, Jiang Y. Neural attention distillation: erasing backdoor triggers from deep neural networks. arXiv:2101.05930. 2021. [Google Scholar]
17. Liu K, Dolan-Gavitt B, Garg S. Neural trojans. In: Proceedings of the IEEE International Symposium on Hardware Oriented Security and Trust (HOST); 2017 May 1–5; Mclean, VA, USA. p. 1–8. [Google Scholar]
18. Rahman G, Saeed-Uz-Zaman, Li B, Muzamal JH. Hybridized shield: a framework for backdoor detection in secure federated learning systems. In: Proceedings of the 2024 IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI); 2024 Jul 5–7; Beijing, China. p. 199–204. [Google Scholar]
19. Staunton C, Shabani M, Mascalzoni D, Mežinska S, Slokenberga S. Ethical and social reflections on the proposed European Health Data Space. Eur J Hum Genetics. 2024;32(5):498–505. doi:10.1038/s41431-024-01543-9. [Google Scholar] [PubMed] [CrossRef]
20. Pan Z, Ying Z, Wang Y, Zhang C, Zhang W, Zhou W, et al. Feature-based machine unlearning for vertical federated learning in IoT NEtworks. IEEE Trans Mobile Comput. 2025;24(6):5031–44. doi:10.1109/tmc.2025.3530529. [Google Scholar] [PubMed] [CrossRef]
21. Malle B, Kieseberg P, Weippl E, Holzinger A. The right to be forgotten: towards machine learning on perturbed knowledge bases. In: International Conference on Availability, Reliability, and Security. Cham, Switzerland: Springer; 2016. p. 251–66. [Google Scholar]
22. Bourtoule L, Chandrasekaran V, Choquette-Choo C, Jia H, Travers A, Zhang B, et al. Machine unlearning. In: Proceedings of the 2021 IEEE Symposium on Security and Privacy; 2021 May 24–7; San Francisco, CA, USA. p. 141–59. [Google Scholar]
23. Liu S, Yao Y, Jia J, Casper S, Baracaldo N, Hase P, et al. Rethinking machine unlearning for large language models. Nat Mach Intell. 2025;7(2):181–94. doi:10.1038/s42256-025-00985-0. [Google Scholar] [CrossRef]
24. Cao Y, Yang J. Towards making systems forget with machine unlearning. In: Proceedings of the 2015 IEEE Symposium on Security and Privacy; 2015 May 17–21; San Jose, CA, USA. p. 463–80. [Google Scholar]
25. Huang MH, Foo LG, Liu J. Learning to unlearn for robust machine unlearning. In: European Conference on Computer Vision. Cham, Switzerland: Springer; 2024. p. 202–19. [Google Scholar]
26. Koh PW, Liang P. Understanding black-box predictions via influence functions. In: International Conference on Machine Learning. Westminster, UK: PLMR; 2017. p. 1885–94. [Google Scholar]
27. Guo C, Goldstein T, Hannun A, van der Maaten L. Fast machine unlearning. arXiv:1912.03817. 2020. [Google Scholar]
28. Golatkar A, Achille A, Soatto S. Eternal sunshine of the spotless net: selective forgetting in deep networks. In: Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition; 2020 Jun 13–19; Seattle, WA, USA. p. 9304–12. [Google Scholar]
29. Nguyen Q, Chen X, Low B, Xu H. Variational bayesian unlearning. In: Advances in neural information processing systems. Cambridge, MA, USA: MIT press; 2020. [Google Scholar]
30. Wu X, Liu Z, Wu J, Wang H, Wang X. Federated unlearning. In: Proceedings of the International Conference on Database Systems for Advanced Applications; 2022 Apr 11–14; Online. p. 19–34. [Google Scholar]
31. Zhu L, Liang Y, Yu W, Chen K. Federated unlearning with knowledge distillation. arXiv:2108.09491. 2021. [Google Scholar]
32. Neel S, Roth A, Sharifi-Malvajerdi S. Descent-to-delete: gradient-based methods for machine unlearning. In: Algorithmic learning theory. Berlin/Heidelberg, Germany: Springer; 2019. p. 931–62. [Google Scholar]
33. Zhou Y, Kantarcioglu M, Xi B. A survey of game theoretic approach for adversarial machine learning. Wiley Interdiscip Rev Data Mining Knowl Discov. 2019;9(3):e1259. doi:10.1002/widm.1259. [Google Scholar] [CrossRef]
34. Pérolat J, Leibo JZ, Zambaldi V, Beattie C, Tuyls K, Graepel T. Actor-critic fictitious play in games with continuous action spaces. In: Proceedings of the International Conference on Artificial Intelligence and Statistics; 2018 Apr 9–11; Playa Blanca, Lanzarote. p. 919–28. [Google Scholar]
35. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations; 2018 Apr 30–May 3; Vancouver, BC, Canada. [Google Scholar]
36. Luh R, Eresheim S, Tavolato P, Petelin T, Gmeiner S, Holzinger A, et al. Gamifying information security: adversarial risk exploration for IT/OT infrastructures.Comput Secur. 2025;151:104287. [Google Scholar]
37. Lowd D, Meek C. Adversarial learning. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2005 Aug 21–24; Chicago, IL, USA. p. 641–7. [Google Scholar]
38. Jia L, Qi N, Su Z, Chu F, Fang S, Wong KK, et al. Game theory and reinforcement learning for anti-jamming defense in wireless communications: current research, challenges, and solutions. IEEE Commun Surv Tutor. 2025;27(3):1798–838. [Google Scholar]
39. Wang B, Yao Y, Shan S, Li H, Viswanath B, Zheng H, et al. Certifying robustness against backdoor attacks via randomized smoothing. In: Advances in neural information processing systems. Cambridge, MA, USA: MIT Press; 2020. [Google Scholar]
40. Wellman MP, Tuyls K, Greenwald A. Empirical game theoretic analysis: a survey. J Artif Intell Res. 2025;82:1017–76. doi:10.1613/jair.1.16146. [Google Scholar] [CrossRef]
41. Kang D, Bhagoji AN, Steinhardt J, Song D. Game-theoretic modeling of multi-agent security in machine learning systems. arXiv:2003.12996. 2020. [Google Scholar]
42. Jain G, Kumar A, Bhat SA. Recent developments of game theory and reinforcement learning approaches: a systematic review. IEEE Access. 2024;12(3):9999–10011. doi:10.1109/access.2024.3352749. [Google Scholar] [PubMed] [CrossRef]
43. Arora S, Hazan E, Kale S. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 2012;8:121–64. [Google Scholar]
44. Cesa-Bianchi N, Lugosi G. Prediction, learning, and games. Cambridge, UK: Cambridge University Press; 2006. [Google Scholar]
45. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Toronto, ON, USA: University of Toronto; Technical Report. 2009 [Online]. [cited 2025 Aug 26]. Available from: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf. [Google Scholar]
46. Stallkamp J, Schlipsing M, Salmen J, Igel C. Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. In: Neural networks. Amsterdam, The Netherlands: Elsevier; 2012. Vol. 32, p. 323–32. doi:10.1016/j.neunet.2012.02.016. [Google Scholar] [CrossRef]
47. Deng J, Dong W, Socher R, Li LJ, Li K, Li F. ImageNet: a large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20–25; Miami, FL, USA. p. 248–55. [Google Scholar]
48. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. p. 770–8. [Google Scholar]
49. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. 2014. [Google Scholar]
50. Liu Y, Ma S, Aafer Y, Lee WC, Zhai J, Wang W, et al. Trojaning attack on neural networks. In: Proceedings of the Network and Distributed Systems Security Symposium (NDSS 2018); 2018 Feb 18–21; San Diego, CA, USA. p. 1–15. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools