AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline

Sudhir Pandey; Jian-Xun Mi; Zahid Ullah; Mona Jamjoom

doi:10.32604/cmc.2026.074811

icon Open Access

ARTICLE

AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline

Sudhir Kumar Pandey¹, Jian-Xun Mi^1,*, Zahid Ullah², Mona Jamjoom³

1 School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China
2 Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
3 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia

* Corresponding Author: Jian-Xun Mi. Email: email

Computers, Materials & Continua 2026, 87(3), 98 https://doi.org/10.32604/cmc.2026.074811

Received 18 October 2025; Accepted 28 February 2026; Issue published 09 April 2026

Abstract

Adversarial examples in object detection frequently fail to transfer between different models because attacks overfit to the source model’s architecture and feature space. We propose AugTrans, a framework that addresses this limitation through input-space regularization. Our key innovation is a multi-stage augmentation pipeline that incorporates object-level semantic awareness into transformation design. The pipeline comprises three novel components: dynamic object-centric rotation with adaptive scheduling, multi-box aware resizing based on ground-truth annotations, and composite noise injection. These transformations are integrated within the Expectation over Transformation (EOT) framework. By optimizing perturbations to remain effective across semantically meaningful transformations, our method forces attacks to target vulnerabilities shared across architectures. Experiments on MS COCO demonstrate that our method reduces YOLOv5s mean Average Precision (AP) from 32.6% to 2.06%, substantially outperforming prior general-purpose transfer methods on one-stage detectors. All AP values denote COCO-style mean Average Precision (mAP@[0.5:0.95]) unless noted. Importantly, our method maintains effectiveness when using predicted bounding boxes (1.93% AP), eliminating the ground-truth dependency for practical black-box scenarios. Our approach also demonstrates competitive transferability to transformer-based detectors (DETR-R50 AP: 2.8%, DINO-R50 AP: 5.4%), although specialized transformer-specific methods achieve superior performance when the target architecture is known. These results establish that semantically aware augmentation constitutes an effective strategy for generating transferable attacks. We discuss both the security implications and potential defensive applications of our findings.

Keywords

Object detection; adversarial examples; transferability; data augmentation; black-box attack; non-robust features; transformer-based detectors

Supplementary Material

Supplementary Material File

1 Introduction

Convolutional Neural Networks (CNNs) achieve state-of-the-art results across computer vision tasks, including image classification, segmentation, and object detection [1–4]. CNN-based object detectors now form the backbone of applications ranging from autonomous driving to medical imaging. Related adversarial vulnerabilities have also been demonstrated in multimodal, sensor-based machine learning systems, highlighting broader cyber-physical risks beyond vision-only pipelines [5]. However, these models remain susceptible to adversarial examples—carefully crafted inputs with imperceptible perturbations that induce incorrect outputs [6,7]. Adversarial examples exhibit an important property called transferability: an attack crafted on one model can successfully transfer to other unknown models. Generating highly transferable attacks against black-box object detectors carries significant implications for AI security [8,9].

Object detectors present a more challenging attack surface than image classifiers due to their multi-task nature, which requires simultaneous disruption of classification and localization tasks across multiple instances per image. Modern detectors exhibit significant architectural diversity, encompassing two-stage models such as Faster R-CNN [10], one-stage models such as SSD [11] and YOLO [12], and transformer-based architectures such as DETR [13] and DINO [14]. A transferable attack must prove effective against fundamentally different processing pipelines: an attack successful against a Region Proposal Network (RPN) may fail against anchor-free models or transformer-based detectors that rely on global self-attention. This architectural heterogeneity poses a fundamental challenge—adversarial perturbations tend to overfit to source model characteristics.

Adversarial perturbations optimized against a single source model tend to exploit model-specific rather than fundamental vulnerabilities. When applied to target models, these “overfitted” perturbations often fail because they do not manipulate non-robust features common across architectures, as demonstrated by Ilyas et al. [15]. True transferability requires perturbations that target these shared features through appropriate regularization during optimization.

Existing transferability enhancement methods for object detection apply task-agnostic transformations without considering semantic scene structure. Current EOT-based approaches [9] employ generic augmentations (random resizing, padding, rotation) that treat all images uniformly regardless of object locations, sizes, or scene composition. Random resizing, for instance, applies identical scale factors whether objects are large or small, creating unrealistic distortions that enable shortcut learning rather than forcing attacks to target robust shared features. This content-independent approach misses opportunities to regularize perturbations toward semantically plausible transformations that better capture real-world variations across architectures.

We address this gap through semantically aware augmentation within the Expectation over Transformation (EOT) framework [16]. Our key innovation involves incorporating object-level information directly into transformation design. Rather than applying uniform random scaling, we introduce content-adaptive transformations where augmentation intensity is proportional to actual scene characteristics: images with large objects receive aggressive scale variations, while images with small objects receive conservative variations, ensuring semantic plausibility. Our object-centric rotation considers ground-truth object positions when selecting rotation centers, stochastically choosing among the image center, random points, or actual object centers. By optimizing perturbations to remain effective across these content-aware transformations, the attack targets vulnerabilities shared across architectures rather than model-specific artifacts. This approach, adapts EOT from physical-world robustness to achieve superior digital transferability.

It is important to emphasize that our contribution extends fundamentally beyond simply adding bounding boxes to existing EOT frameworks. While prior methods [9,16] employ content-agnostic transformations with fixed distributions regardless of image content, we introduce content-adaptive transformations where augmentation parameters are computed as functions of actual scene characteristics. This principled design, grounded in the non-robust features hypothesis [15], yields improvement over state-of-the-art methods and enables competitive cross-paradigm transferability from CNNs to transformer-based detectors.

1.1 Contributions

Our contributions are as follows:

• Object-Aware Augmentation Pipeline: We present a novel framework to incorporate ground-truth bounding box information directly into transformation parameter computation, creating semantically plausible variations for object detection tasks.

• Content-Adaptive Scaling: We propose a proportional transformation mechanism that adjusts augmentation intensity based on scene content, implicitly addressing multi-scale challenges and forcing attacks to target scale-invariant features shared across architectures.

• Synergistic Integration: We systematically combine complementary transformations (content-adaptive resizing, object-centric rotation, composite noise, reflection padding) within EOT, with ablation studies validating their synergistic contributions.

• Cross-Paradigm Transferability: We demonstrate architecture-agnostic transfer from CNN-based source models to both conventional and transformer-based detectors (YOLOv5s AP: 2.06%, DETR-R50 AP: 2.8%, DINO-R50 AP: 5.4% on MS COCO), achieving effective generalization without target-specific tuning across fundamentally different processing paradigms.

• Practical Black-Box Applicability: We demonstrate effectiveness using predicted bounding boxes from the source detector (YOLOv5s AP: 1.93% on MS COCO), eliminating ground-truth dependency and validating real-world applicability without dataset annotations.

• Comprehensive Evaluation: We provide extensive experiments on PASCAL VOC and MS COCO, including quantitative comparisons, per-category analysis, feature visualizations, and rigorous ablation studies across multiple detector families.

2 Related Work

Adversarial attacks on object detectors present unique challenges compared to image classification due to the multi-task nature requiring simultaneous disruption of localization and classification [17–19]. We structure existing work into key categories that contextualize our contribution.

2.1 Theoretical Foundations of Transferability

Understanding why adversarial perturbations transfer from one model to another is fundamental to designing effective attacks. Ilyas et al. [15] demonstrated that adversarial examples originate from models learning highly predictive but brittle features inherent to the data distribution—termed “non-robust features.” These features prove useful for standard accuracy but remain vulnerable to small perturbations. Transferability occurs when perturbations exploit such features that are common across various model architectures [20,21].

Subsequent work has examined the geometry of adversarial perturbations [22] and demonstrated that transferable attacks tend to occupy flatter loss basins, which represent generalization across models. Researchers have also analyzed neural network decision boundaries, demonstrating that adversarial examples transfer when they push inputs across shared decision boundaries. These theoretical insights provide a foundation: true transferability requires perturbations that exploit fundamental shared vulnerabilities rather than model-specific artifacts. The challenge lies in designing optimization procedures that enforce this property, which motivates our regularization-based approach.

2.2 Theoretical Foundation for Semantic-Aware Augmentation

We establish a theoretical framework grounded in the non-robust features hypothesis [15] explaining how semantic awareness enhances transferability.

Definition 1 (Semantic Plausibility): A transformation 𝒯:𝒳→𝒳 is semantically plausible for image x with annotations ℬ={b1,…,bn} if p(𝒯(x)|ℬ)>τ, where p(⋅|ℬ) represents the conditional distribution of natural images given object configurations.

Intuitive Interpretation: The conditional probability p(𝒯(x)|ℬ)>τ measures whether a transformed image 𝒯(x) remains statistically consistent with natural images containing similar object configurations ℬ. Consider a concrete example: if an image contains small distant objects (e.g., cars 50 pixels wide at 100 meters), applying aggressive 2× scaling would produce unrealistic 100-pixel objects at the same perceived distance—such distortions yield low p(𝒯(x)|ℬ) and violate the threshold τ. Conversely, moderate scaling proportional to object size (e.g., 1.2× for large objects, 1.05× for small objects) maintains p(𝒯(x)|ℬ)>τ, keeping transformations within the natural image manifold—the subspace of statistically plausible images observed in real-world scenes.

images

Contrasting Transformation Distributions: The distinction between Φsemantic and Φgeneric is critical for understanding our contribution. Content-agnostic methods sample transformations from fixed, image-independent distributions (e.g., uniform random scaling s∼𝒰(0.8,1.2) applied identically to all images), which generate semantically implausible variations—such as distorting small objects beyond realistic aspect ratios or creating scale mismatches inconsistent with scene depth. In contrast, Φsemantic conditions transformation parameters on actual scene content: scaling intensity adapts to object size (s=1+robj⋅ρ where robj reflects true object scale, Section 3.3.3), and rotation centers align with object positions rather than being uniformly random. This conditional sampling ensures all transformations remain within the natural manifold, forcing perturbations to exploit fundamental architectural vulnerabilities rather than artifacts of unrealistic distortions.

Justification: Content-agnostic transformations enable shortcuts: when 𝒯 ignores semantics, 𝒯(x) may leave the natural manifold, allowing perturbations to exploit transformation artifacts. Semantic constraints force targeting of features that are: (i) scale-invariant via s∝robj, (ii) viewpoint-invariant through object-centric rotation, and (iii) architecture-agnostic by construction.

Information-Theoretic View: Semantic-aware augmentation maximizes I(δ;ℱshared) while minimizing I(δ;ℱmodel), ensuring Φsemantic covers architecture-common variations.

Practical Implementation Note:1 Our content-adaptive transformations (box-aware resize, object-centric rotation) intrinsically satisfy Definition 1 without requiring explicit computation of p(𝒯(x)|ℬ). The threshold τ serves as a theoretical construct ensuring transformations preserve statistical realism.

2.3 Object Detection Architectures and Attack Surfaces

Object detectors fall into three general architecture families, each with distinct attack surfaces requiring different vulnerability exploitation strategies.

Two-stage detectors such as Faster R-CNN [10] and Cascade R-CNN [23] operate in sequential stages using a Region Proposal Network (RPN) and detection head. Attacks can target either stage, preventing proposal generation [7,24] or causing misclassification in the detection head [25,26]. This sequential architecture creates multiple vulnerability points across the pipeline stages—region information, intermediate results, or final classifications.

One-stage detectors including YOLO [3,12], SSD [11], RetinaNet [27], and FCOS [28] directly predict dense feature maps to reduce inference time. These models are traditionally attacked by maximizing the loss over spatial locations [9,29,30]. Their dense prediction approach renders them both efficient and potentially more vulnerable to spatially distributed perturbations.

Transformer-based detectors, pioneered by DETR [13] and advanced by Deformable DETR [31], DINO [14], and Conditional DETR [32], represent a paradigm shift. By treating detection as set prediction through global self-attention, these models eliminate hand-designed components such as anchors and NMS. Unlike CNNs that process images through hierarchical local features, transformers employ global self-attention mechanisms from early layers, fundamentally altering the vulnerability landscape. The architectural heterogeneity of these three families creates the primary challenge for transferable attacks: perturbations must prove effective across fundamentally different processing pipelines, from local convolution-based feature extraction to global attention mechanisms.

2.4 Ensemble and Single-Source Transfer Methods

A natural approach to enhancing transferability involves optimizing perturbations across multiple source models simultaneously. Tramèr et al. [8] introduced ensemble adversarial training, demonstrating that attacks effective against multiple models transfer more readily to unseen models. This occurs because the perturbation must exploit vulnerabilities common to all ensemble members, thereby eliminating model-specific artifacts. Recent work extends this principle to object detection [33–35], demonstrating improved black-box performance through attack generation across various detector ensembles. Liu et al. [36] showed that ensemble diversity—measured by architecture, training procedure, and objective function differences—correlates positively with attack transferability. Dong et al. [37] introduced momentum-based iterative attacks coupled with ensemble optimization, achieving state-of-the-art transfer performance.

However, ensemble methods suffer from a critical limitation: computational cost scales linearly with ensemble size. Each optimization iteration involves a forward-backward pass through all source models, rendering the approach prohibitively expensive for large ensembles or resource-constrained settings. This limitation has motivated exploration of single-source alternatives that achieve comparable transferability through other regularization mechanisms, such as input-space transformations and gradient variance reduction schemes [38]. Our work operates in this single-source paradigm, demonstrating that semantically aware augmentation can approach ensemble-level transferability without the computational overhead.

2.5 Input Transformation Methods

Input transformation has emerged as an efficient alternative to ensemble methods for enhancing transferability. Xie et al. [39] proposed Diverse Input (DI-FGSM), applying random resizing and padding during optimization to improve transferability for image classification. This work adapted the Expectation over Transformation (EOT) framework [16], originally developed for physical-world robustness, to enhance digital transferability. The key insight is that optimizing attacks to remain effective over a distribution of transformed inputs forces perturbations to target more fundamental, transformation-invariant features.

Transformation diversity has been extended along multiple dimensions. Dong et al. [37,40] proposed translation-invariant attacks using momentum and diverse inputs, demonstrating that spatial shift invariance improves black-box performance. Lin et al. [41] applied Nesterov accelerated gradient for improved optimization dynamics in the presence of transformations. Wang and He [42] introduced scale-invariant attacks through multi-scale feature aggregation, demonstrating that scale invariance proves particularly important for object detection where objects appear at various sizes.

For object detection specifically, T-SEA [9] combined self-ensemble augmentation with ShakeDrop regularization, demonstrating the value of transformation-based regularization in detection tasks.

Critical Differentiation from DI-FGSM and T-SEA: While our approach builds upon the EOT framework shared with DI-FGSM [39] and T-SEA [9], our fundamental design philosophy differs substantially from these content-agnostic methods:

Comparison with DI-FGSM: DI-FGSM applies uniform random transformations (resizing, padding) sampled from fixed distributions regardless of image content. For instance, all images receive identical scale factor distributions s∼𝒰(0.875,1.00) and padding ranges, treating a close-up portrait identically to a landscape with distant objects. This task-agnostic approach, originally designed for image classification, ignores the spatial structure and object-level semantics critical to detection tasks.

Comparison with T-SEA: Although T-SEA targets object detection, it retains content-agnostic transformations: random resizing samples scale factors from fixed distributions (e.g., s∼𝒰(0.8,1.2)) regardless of whether images contain large dominant objects or small distant ones. Similarly, random rotation applies identical angle distributions without considering object positions.

Our Innovation: In contrast, AugTrans introduces content-adaptive transformations where augmentation parameters are computed as functions of actual scene characteristics:

(1) Multi-box aware resizing: We compute scale factors as s=1+robj⋅ρ where robj represents the normalized size of the Kobj largest objects. Large-object scenes receive aggressive scaling (robj→1⇒ wider range), while small-object scenes receive conservative variations (robj→0⇒ narrower range), ensuring semantic plausibility.

(2) Object-centric rotation: We stochastically select rotation centers from {𝒞img,𝒞rand,𝒞obj}, incorporating actual object locations into geometric transformations rather than rotating blindly around image centers.

(3) Adaptive scheduling: We employ curriculum-based progression (θ∝k/Kmax) rather than static angle distributions, preventing optimization failure in early iterations.

This content-awareness ensures transformations remain semantically plausible—preserving the statistical regularities that define shared non-robust features across architectures—rather than introducing artificial distortions that enable shortcut learning. Our experiments demonstrate that our method achieves substantially lower AP on YOLOv5s (2.06%) compared to T-SEA (8.6%), demonstrating that semantic awareness constitutes the critical factor enabling superior transferability.

Apart from our approach, other contemporary studies have focused on optimization refinements. Shi et al. [30] improved transferability through better augmentation scheduling and loss formulations. Recent advances (2023–2025) have further extended the boundaries of transferable attacks. Wang et al. [43] proposed DIB-UAP, which leverages deep information bottleneck (DIB) theory to extract and disrupt crucial intermediate features. By identifying features that significantly impact detection through mutual information analysis, DIB-UAP achieves notable cross-architecture transferability. Additionally, they introduce a Scale & Tile augmentation strategy to improve attack effectiveness on medium and large objects. Xue et al. [44] proposed an object feature-wise attention mechanism combined with perturbation extraction using generative adversarial networks, demonstrating improved stealthiness while maintaining attack transferability. Chen et al. [45] improved ensemble transferability through dynamic model diversity weighting, achieving enhanced cross-architecture performance. That validates exploiting shared structural properties—similar to our semantic-aware approach. Zhang et al. [46] demonstrated that path-augmented methods improve transferability by optimizing perturbations along multiple transformation pathways, achieving 18% improvement over baseline methods.

However, these methods either require ensemble optimization (computationally expensive), architectural knowledge (violating black-box constraints), or focus on frequency-domain manipulations without semantic awareness. Despite these advances, existing methods employ statistical feature selection (DIB-UAP) or require complex generative models (Xue et al.), which may not optimally generalize across diverse detector architectures. Our approach differs fundamentally by incorporating ground-truth object information directly into transformation design within a single-source paradigm, forcing attacks to target semantically meaningful invariances inherently shared across architectures—including the emerging CNN-to-transformer transfer challenge that these recent works largely overlook.

Despite these advances, existing methods employ generic augmentations (random resize, padding, rotation) without regard to task-specific semantics. A significant gap therefore remains: for object detection, transformations that ignore object locations, perspectives, scene composition, and object sizes may cause unrealistic distortions that allow the optimization to find shortcuts. Random resizing, for example, applies identical scale factors regardless of whether images contain large dominant objects or small distant ones. This content-agnostic approach fails to leverage opportunities to regularize the attack toward semantically realistic transformations that better capture real-world variations across architectures. Our work addresses this gap through content-aware augmentation design.

2.6 The Augmentation-Robustness Connection

Data augmentation has emerged as a cornerstone of adversarial robustness during model training [47–49]. The intuition is straightforward: if models learn features invariant to augmentations, they become more robust to both natural corruptions and adversarial perturbations. Methods such as AugMix [50], AutoAugment [51], and RandAugment [48,52] demonstrate that diverse augmentation strategies encourage learning of robust, generalizable features. Xie et al. [53] demonstrated that adversarial examples can improve the performance of standard models when used as deep augmentation, implying deep connections between adversarial robustness and augmentation-invariant features.

We propose an inversion of this reasoning: if augmentation-invariant features constitute robust features during training, then augmentation-invariant non-robust features should be shared across architectures during attack optimization. By tuning perturbations to remain effective across semantically rich transformations, we force the attack to exploit fundamental vulnerabilities shared across models rather than model-specific artifacts. This insight connects robustness research with transferability. Prior studies suggest that EOT can be modified for digital portability when augmentations are designed with task-specific awareness. However, an important question remains: should attack augmentations mirror those used for defense training, or should they be application-specific? For object detection, the semantics of object scale, location, and spatial arrangement are absent from generic image augmentations, motivating our object-aware augmentation design.

2.7 Attacks on Transformer-Based Detectors

Transformer-based detectors introduce unique challenges for cross-architecture transferability. Recent work has developed specialized attacks targeting transformer-specific vulnerabilities. Methods such as BFDA [54] achieve strong performance by perturbing attention maps and query-key interactions characteristic of transformer architectures. Shao et al. [55] and Fu et al. [56] conducted comprehensive analyses demonstrating that vision transformers exhibit distinct adversarial vulnerability patterns compared to CNNs: perturbations exploit self-attention mechanisms and global feature aggregation from early layers through fundamentally different pathways than convolutional feature hierarchies.

However, these specialized approaches require architectural knowledge of the target model and employ different strategies for CNN vs. transformer targets. This limits their applicability in black-box attack scenarios where the target architecture remains unknown. The fundamental architectural difference between CNNs (local and hierarchical) and transformers (global and attention-based) presents a significant challenge: perturbations exploiting CNN-specific inductive biases may fail against attention mechanisms, while transformer-specific attacks may not transfer to convolutional models. This motivates investigation of general augmentation strategies capable of achieving cross-paradigm transferability.

Despite considerable effort, current transfer attack methods exhibit severe shortcomings. Current EOT-based approaches employ task-agnostic transformations without considering object detection semantics—all images receive identical processing regardless of object location, size, or scene composition. Random resizing applies the same scale factors for large and small objects, creating artificial distortions that facilitate shortcut learning rather than forcing robust feature targeting. Static augmentation schedules fail to exploit curriculum-based progression, and content-independent transformations cannot capture semantically plausible variations that drive true cross-architecture transferability.

Our framework addresses these gaps through semantically aware augmentation: (1) multi-box aware resizing couples transformation intensity directly to actual object content, ensuring plausible scale variations proportional to scene characteristics; (2) dynamic object-centric rotation incorporates ground-truth object positions into transformation design, forcing invariance to semantically relevant geometric shifts; (3) adaptive scheduling implements curriculum learning from conservative to aggressive transformations. By elevating augmentation from generic preprocessing to task-aware regularization, we achieve transferability rivaling ensemble methods while maintaining single-source computational efficiency—particularly valuable in true black-box scenarios where the target architecture remains unknown.

Notation

Before detailing our methodology pipeline, we establish the key mathematical notation used throughout this section. Table 1 summarizes the primary symbols, with comprehensive definitions provided in Supplementary Materials Section S7.

images

3 Methodology

Our approach generates highly transferable adversarial perturbations through semantically aware input transformations. By optimizing across content-adaptive transformations, we target fundamental vulnerabilities shared across diverse detector architectures.

3.1 Threat Model

Setup: We consider a transfer-based black-box attack where the attacker has white-box access to a source detector 𝒟source (full architecture, parameters, and gradients) but only black-box query access to victim detectors 𝒟victim (unknown architecture, no gradients).

Goal: Craft imperceptible perturbations δ (‖δ‖∞≤5/255) optimized on 𝒟source that transfer to degrade 𝒟victim without gradient access.

Constraints: No knowledge of victim architecture, defenses, or training procedures; perturbations must remain visually imperceptible.

3.2 Problem Formulation

Let D denote an object detector that processes an input image x∈RH×W×Nc with ground-truth annotations y to produce detection predictions D(x), where Nc denotes the number of image channels. In a transfer-based adversarial attack scenario, we aim to craft a perturbation δ such that the adversarial example xadv=x+δ causes an unknown victim detector Dvictim to fail.

The perturbation is generated using white-box access to a source detector Dsource, meaning we have access to its architecture, parameters, and gradients throughout the optimization process. To ensure visual imperceptibility, the perturbation is constrained by an L∞-norm: ‖δ‖∞≤ε, where ε represents a small constant (typically ε =5/255 in our experiments).

The primary challenge in transfer attacks is that perturbations tailored to a specific source model often exploit model-specific artifacts rather than fundamental vulnerabilities. When applied to architecturally distinct target models, these overfitted perturbations frequently fail because they do not target features common across multiple architectures.

3.3 Semantically Aware Augmentation Pipeline

To address this challenge, we implement the Expectation over Transformation (EOT) framework with a novel approach. Instead of generic augmentations that ignore image content, we construct a transformation pipeline based on actual information about object locations and sizes. This forces our attack to function across realistic variations, making it robust enough to transfer between different detector architectures. The key idea involves creating a semantically informed transformation distribution Φ, where each sampled transformation combines multiple content-aware operations that produce diverse, realistic variations. Unlike previous work that applies identical transformations regardless of image content, our pipeline adapts to scene properties such as object positions, sizes, and spatial distribution. During attack optimization, a fresh transformation sequence comprising four complementary components is assigned to each EOT sample.

3.3.1 Conceptual Rationale

Before presenting our augmentation components, we first establish the theoretical motivation for semantic awareness in EOT frameworks. Content-agnostic methods [9,39] apply uniform transformations with fixed distributions (e.g., s∼𝒰(0.8,1.2) for all images), creating semantically implausible variations that enable “shortcut” learning. For instance, aggressively scaling small objects produces unrealistic aspect ratios outside the natural image manifold, allowing perturbations to exploit transformation artifacts rather than fundamental vulnerabilities.

Our semantic-aware principle addresses this: by computing transformation parameters as functions of actual scene characteristics—s=1+robj⋅ρ scales proportionally to object size—we constrain optimization to semantically plausible variations. As formalized in Proposition 1 (Section 2.2), this forces perturbations to target features that are both: (a) predictive across architectures, and (b) vulnerable to realistic input variations. These shared non-robust features [15] enable cross-architecture transferability, yielding improvement over content-agnostic T-SEA (YOLOv5s AP: 2.06% vs. 8.6%). Ablation studies confirm that Box-Aware Resize—our most semantically informed component—provides the strongest individual contribution, validating that semantic awareness, not merely augmentation diversity, constitutes the critical mechanism.

3.3.2 Dynamic Object-Centric Rotation

The first component rotates the input image by an angle θ around a center point 𝒞rot. This module introduces two novel mechanisms that distinguish it from conventional rotation-based augmentations.

Dynamic Angle Scheduling.

Rather than sampling rotation angles from a fixed distribution throughout optimization, we employ a curriculum-based progressive schedule. The rotation angle is sampled from an adaptive distribution that expands as optimization progresses:

θ∼𝒰(θmin,θmax)⋅(1+c⋅kKmax),(2)

where θ denotes the rotation angle (degrees), 𝒰(θmin,θmax) denotes a uniform distribution over the base angle range, k represents the current iteration index, Kmax denotes the total number of iterations, and c is a curriculum scaling factor (typically c=0.5).

Design Rationale. We set c=0.5 to balance two competing objectives: (1) optimization stability in early iterations, where aggressive rotations cause gradient vanishing and hinder effective perturbation learning, and (2) geometric robustness in later iterations, where expanded rotation ranges ensure the final perturbation transfers across substantial geometric transformations. Empirically, c<0.5 converges slowly due to insufficient late-stage diversity, while c>0.5 causes early-stage optimization instability (validation in Supplementary Materials Section S2).

This dynamic scheduling provides an easy-to-hard curriculum during optimization. Early iterations employ conservative rotation angles to prevent optimization failure, while later iterations utilize aggressive rotations to ensure the final perturbation remains effective under substantial geometric transformations.

Stochastic Center Selection.

The rotation center 𝒞rot is selected stochastically from a diverse set of candidates:

𝒞rot∈{𝒞img,𝒞rand,𝒞obj},

where 𝒞img denotes the image center, 𝒞rand represents a uniformly sampled random point within the image bounds, and 𝒞obj is the center of a randomly selected ground-truth bounding box.

The inclusion of 𝒞obj represents a key innovation: by occasionally rotating around actual object locations, we force the perturbation to be robust to object-centric viewpoint variations. This proves particularly important for detection tasks where objects may appear at various positions and orientations within scenes.

3.3.3 Multi-Box Aware Resizing

Building upon geometric invariance, we next address scale variation through content-adaptive resizing. This component represents our primary semantic contribution: rather than applying uniform random scaling, we compute scale factors (sh,sw) that are directly coupled to the actual content and object characteristics of each individual image.

Content-Aware Scale Computation.

We first identify the Kobj largest ground-truth objects by area. We set Kobj=3 based on three considerations: (1) computational efficiency in dense scenes (COCO averages 7.7 objects/image), (2) the three largest objects typically account for 68%–75% of total object area, sufficiently characterizing scene scale, and (3) automatic adaptation to small-object scenes—when all objects are small, Kobj=3 still captures this characteristic through small rh and rw values, applying appropriately conservative transformations. For images containing fewer than Kobj objects, we use all available objects. Preliminary experiments with Kobj∈{1,3,5,10} demonstrated that Kobj=3 provides the optimal performance-efficiency balance.

For these dominant objects, we compute their mean dimensions:

h¯obj=1Kobj∑i=1Kobjhi,w¯obj=1Kobj∑i=1Kobjwi,

where hi and wi represent the height and width of the i-th largest bounding box, respectively.

We then normalize these dimensions by the image size to obtain relative object size metrics:

rh=h¯objH,rw=w¯objW,

where H and W denote the image height and width, respectively.

The base scaling factors are computed by coupling these relative object sizes directly with randomly sampled modulation factors:

shbase=1+rh⋅ρh,swbase=1+rw⋅ρw,(3)

where ρh,ρw∼𝒰(ρmin,ρmax) are independently sampled from a uniform distribution over the modulation range (typically ρmin=−0.2, ρmax=0.5). The base scale of unity represents the identity transformation, with the content-dependent term r⋅ρ adding adaptive variation proportional to actual object sizes.

Aspect Ratio Jitter.

To introduce realistic shape distortions while maintaining the content-adaptive scale characteristics, we apply independent random perturbations to each dimension:

sh=shbase⋅(1+ζh),sw=swbase⋅(1+ζw),(4)

where ζh,ζw∼𝒰(−ζar,ζar) represent small independent jitters (typically ζar=0.1) sampled uniformly around zero.

3.3.4 Contextual Cropping and Reflective Padding

After rotation and resizing, transformed images must be returned to their original dimensions to maintain compatibility with detector input requirements. This is achieved through adaptive cropping or padding depending on the transformation outcome.

For images enlarged beyond the original dimensions, we apply random cropping to extract a region matching the original size. For images reduced below the original dimensions, we apply reflection padding with random spatial offsets to restore the original size.

3.3.5 Composite Noise Injection

As the final transformation component, we inject a complex composite noise pattern ω into the image to simulate realistic degradations:

ω=ωG+ωSPN,(5)

where ωG∼𝒩(0,σ2) represents additive Gaussian noise with standard deviation σ (typically σ=0.02), and ωSPN represents salt-and-pepper noise with probability psp=0.01, where each affected pixel is randomly set to either minimum (0) or maximum (1) intensity with equal probability of 0.005 each.

Pipeline Integration.

The sequential application of these four complementary transformations—rotation (geometric invariance), content-aware resizing (scale invariance), contextual padding (spatial robustness), and composite noise (texture robustness)—creates a highly nonlinear and semantically rich transformation space. Critically, each transformation is implemented in a fully differentiable manner, allowing gradients of the final loss to be backpropagated through the entire pipeline to the input image. This differentiability proves essential for the EOT optimization framework described next.

3.3.6 Formalization of Dynamic Scheduling Strategy

To ensure reproducibility, we formalize the scheduling logic governing augmentation parameter evolution during optimization. Our framework employs selective dynamic scheduling: only rotation angles follow curriculum-based progression, while other parameters remain fixed to preserve semantic plausibility.

Design Rationale. We apply dynamic scheduling exclusively to rotation because: (1) geometric transformations require gradual adaptation to prevent optimization instability, (2) aggressive early rotations cause gradient vanishing, and (3) content-adaptive parameters (scale, noise) are already computed based on scene characteristics, requiring no temporal curriculum.

Algorithm 1 presents the complete scheduling mechanism. Supplementary Materials Section S2 (Fig. S1) validates this approach, demonstrating that progressive scheduling (c=0.5) achieves 7.5% lower final AP50 (41.0% vs. 44.3%) with 20% faster convergence (100 vs. 120 iterations) compared to fixed angle schedules on PASCAL VOC 2012.

images

Key Scheduling Properties:

• Rotation (Dynamic): Starts at θmax(0)=8∘ and increases to θmax(160)=12∘, providing an easy-to-hard curriculum.

• Scaling (Fixed): Scale factors sh,sw are resampled each iteration but maintain constant range ρ∈[−0.2,0.5].

• Noise (Fixed): Gaussian standard deviation σ=0.02 and salt-and-pepper probability psp=0.01 remain constant.

Implementation of Differentiability. All augmentation operations maintain gradient flow: rotation and resizing employ differentiable bilinear interpolation via PyTorch’s affine_grid, grid_sample, and interpolate functions; cropping uses differentiable tensor slicing; reflection padding applies torch.nn.functional.pad with mode = ’reflect’; Gaussian noise addition is trivially differentiable; salt-and-pepper noise uses a straight-through estimator for gradient flow. Discrete operations (top-Kobj object selection, stochastic center choice) are resolved outside the computational graph—only continuous transformation parameters participate in backpropagation. End-to-end differentiability is verified by confirming all intermediate tensors retain requires_grad = True and observing smooth convergence (see Section 4.7 for empirical validation).

3.4 Practical Black-Box Scenario: Using Predicted Bounding Boxes

While our core methodology leverages ground-truth bounding boxes for object-aware transformations during attack optimization, we address the practical concern of black-box scenarios where ground-truth annotations are unavailable. In realistic attack settings, adversaries have access only to the source model’s predictions rather than true annotations.

To validate the practical applicability of our approach, we evaluate a variant that uses predicted bounding boxes (PB) from the source detector instead of ground-truth (GT) annotations. Specifically:

• During attack generation, we obtain bounding box predictions from the source detector (e.g., Faster R-CNN ResNet50) on the clean image.

• These predicted boxes replace ground-truth annotations in our multi-box aware resizing (Section 3.3.3) and object-centric rotation (Section 3.3.2) components.

• All other augmentation parameters and optimization procedures remain identical.

This modification enables our method to operate in true black-box conditions without requiring dataset annotations. Section 4.2.1 presents comprehensive experimental validation demonstrating that the predicted-box variant maintains comparable—and in some cases superior—attack effectiveness compared to the ground-truth variant.

3.5 Attack Objective and Optimization

Having defined our semantically aware transformation pipeline, we now specify the attack objective and optimization procedure that leverages this pipeline to generate transferable adversarial perturbations.

3.5.1 Multi-Component Loss Function

For two-stage detectors like Faster R-CNN [10], which form the backbone of our source model, we construct a comprehensive loss function that targets all critical components of the detection pipeline. The detector operates through two stages: a Region Proposal Network (RPN) that generates object proposals, and a detection head (ROI Head) that classifies and refines these proposals.

To maximally disrupt the detector, our loss function aggregates four primary loss terms spanning both stages:

ℒsample=αcls(ℒcls)γcls+αboxℒbox_reg+αobj(ℒobj)γobj+αrpn_boxℒrpn_box_reg,(6)

where:

• ℒcls denotes the final classification loss from the detection head,

• ℒbox_reg denotes the final bounding box regression loss,

• ℒobj denotes the RPN objectness loss (binary classification of proposals),

• ℒrpn_box_reg denotes the RPN bounding box regression loss.

The weights αcls,αbox,αobj,αrpn_box control the relative contribution of each component. In our experiments, we use αcls=1.0, αbox=1.0, αobj=2.0, and αrpn_box=1.0, placing emphasis on disrupting the objectness predictions.

Gradient Regularization via Nonlinear Scaling.

The most important innovation in our loss formulation is the use of scaling exponents γcls,γobj<1 (typically γcls=γobj=0.8) for the classification and objectness loss terms. This nonlinear transformation serves as a gradient regularizer: low-loss proposals (harder to fool) receive amplified gradient signals, while high-loss proposals (easier to fool) have dampened signals, preventing them from dominating optimization.

For a loss component ℒ with exponent γ<1, the gradient scaling factor γℒγ−1 is inversely proportional to ℒ. With γ=0.8, hard-to-attack objects (ℒ=0.1) receive approximately 2.5× stronger gradient signals compared to easy-to-attack objects (ℒ=10.0), ensuring balanced optimization across all scene elements.

Theoretical Connection to Hard Example Mining. This nonlinear scaling mechanism is conceptually related to focal loss [27] and hard example mining [10], but serves a distinct purpose in the adversarial context. While focal loss re-weights classification errors to address class imbalance during training, our formulation addresses gradient imbalance during attack optimization. Specifically, in multi-object scenes, naive loss summation causes easy-to-fool proposals (large objects, high contrast) to dominate ∇δℒ, leading to perturbations that overfit to source-model-specific artifacts associated with these objects. By applying the concave transformation ℒγ with γ<1, we enforce a variance reduction property: Var(∂ℒγ/∂δ)<Var(∂ℒ/∂δ), ensuring the attack distributes optimization effort uniformly across all object difficulty levels rather than concentrating on model-specific weak points. This uniform coverage is critical for transferability—perturbations effective across diverse object characteristics (size, contrast, position) exploit fundamental shared vulnerabilities rather than architecture-specific biases, directly aligning with the non-robust features hypothesis [15]. Empirically, γ=0.8 reduces per-category variance by 51% while improving transferability by 37% (Supplementary Materials Section S1), validating that gradient regularization constitutes a principled mechanism for learning generalizable adversarial perturbations.

This mechanism proves particularly important in complex scenes with multiple objects of different sizes. A detailed mathematical analysis including quantitative examples, gradient variance reduction, and theoretical justification is provided in Supplementary Materials Section S1.

3.5.2 Iterative EOT-Based Optimization

The complete attack is formulated as an iterative optimization procedure that combines our multi-component loss function with the semantically aware transformation pipeline through the EOT framework. Algorithm 2 presents the detailed procedure.

We employ a Projected Gradient Descent (PGD) style iterative approach [35]. In each iteration, we:

1. Sample NEOT independent transformation sequences from our pipeline.

2. Apply each transformation to the current adversarial candidate.

3. Compute the loss and gradient for each transformed sample.

4. Average the gradients across all EOT samples.

5. Update the adversarial example using the sign of the averaged gradient.

6. Project the perturbation back into the ε-ball around the original image.

This EOT-averaged gradient captures the expected gradient over the distribution of semantically aware transformations, forcing the optimization to find perturbations that remain effective across diverse content-aware variations.

images

Hyperparameter Configuration.

In our experiments, we employ the following hyperparameter settings: maximum perturbation budget ε=5/255, total iterations Kmax=160, step size η=0.0004, and EOT samples per iteration NEOT=10.

For the augmentation pipeline, we set: base rotation angle range (θmin,θmax)=(−8∘,8∘), curriculum scaling factor c=0.5, resizing modulation range (ρmin,ρmax)=(−0.2,0.5), aspect ratio jitter range ζar=0.1, and Gaussian noise standard deviation σ=0.02. We apply 2 sequential transformations per EOT sample, randomly sampling from the available transformation types.

These hyperparameters were selected based on preliminary experiments balancing attack effectiveness with computational efficiency. Our convergence analysis in Section 4.5 demonstrates that 160 iterations suffice for the attack to reach stable performance on both PASCAL VOC and MS COCO datasets. Fig. 1 illustrates the complete overview of the proposed AugTrans framework, showing the full pipeline from input image through augmentation, source detector, and adversarial example generation

images

Figure 1: Overview of the proposed adversarial attack framework. An original image is fed into an advanced data augmentation pipeline within an Expectation over Transformation (EOT) loop. The pipeline applies a sequence of stochastic, differentiable transformations (Dynamic Rotation, Box-Aware Resize, Noise Injection, Crop/Pad). The augmented adversarial candidate is passed through the source detector (comprising a Backbone, RPN, and ROI Head). The total loss, combining RPN and ROI Head losses, is computed. The gradient of this EOT-averaged loss is then used to update the adversarial example, which is iteratively refined to fool the detector while remaining visually similar to the original image.

4 Experiments

We conduct comprehensive evaluations to assess our method’s transferability across diverse detector architectures, datasets, and attack scenarios. Our experiments validate three key claims: (1) superior black-box transferability compared to state-of-the-art methods, (2) effective cross-paradigm transfer from CNN to transformer-based detectors, and (3) practical applicability using predicted bounding boxes without ground-truth annotations.

4.1 Experimental Setup

Datasets. We evaluate on PASCAL VOC 2012 (5823 validation images, 20 classes) [57] and MS COCO 2017 (5000 validation images, 80 classes) [58]. We report mean Average Precision (mAP): AP50 denotes mAP at IoU threshold 0.5, while AP denotes average mAP over IoU thresholds 0.5–0.95 (step 0.05). Lower values indicate more effective attacks.

Models. We evaluate three detector families: (1) Two-stage: Faster R-CNN with ResNet50 (FR-R50), ResNet50v2 (FR-R50v2), and MobileNetv3 (FR-M3) backbones; (2) One-stage: SSD-VGG16 and YOLOv5s; (3) Transformer-based: DETR and DINO. We use FR-R50 as the primary source model unless stated otherwise.

Baselines. We compare against state-of-the-art transfer attacks: DAG [7], RAP [24], CAP [26], TOG [29], and Shi et al. [30]. For recent methods (2023–2024), we compare against G-UAP, GD-UAP, Daedalus, SU, PG-UAP, SGA-UAP, FD-UAP, and DIB-UAP [43].

Hyperparameters. All experiments use: perturbation budget ε =5/255, iterations Kmax=160, step size η=0.0004, EOT samples NEOT=10 per iteration. Augmentation parameters: rotation range θ∈[−8∘,8∘], curriculum factor c=0.5, resizing modulation ρ∈[−0.2,0.5], aspect ratio jitter ζar=0.1, Gaussian noise σ=0.02, salt-and-pepper probability p=0.01. Loss weights: αcls=1.0, αbox=1.0, αobj=2.0, αrpn_box=1.0, with gradient regularization γcls=γobj=0.8.

Implementation. We use PyTorch 1.13.0 with CUDA 11.7 on an NVIDIA RTX 3090 GPU. The random seed is set to 42 for reproducibility. Pretrained weights are obtained from PyTorch Vision Model Zoo2 (Faster R-CNN variants), Ultralytics3 (YOLOv5s), and official repositories (DETR, DINO).

Code Availability. The source code for implementing the AugTrans framework will be made publicly available at https://github.com/sudhirpandey243/LLM-Model.

Predicted Bounding Boxes. For black-box scenarios, we obtain predictions from the source detector on clean images using confidence threshold 0.5 and NMS IoU threshold 0.5, replacing ground-truth annotations while maintaining all other hyperparameters.

Note on Baseline Performance. Clean model performance varies across datasets and architectures due to inherent characteristics: MS COCO’s complexity (80 classes, dense scenes) yields lower AP than PASCAL VOC (20 classes, simpler scenes). Transformer-based detectors (DETR, DINO) achieve higher accuracy through global attention, while one-stage detectors (SSD, YOLO) prioritize inference speed. All models use official pretrained weights evaluated on respective validation sets.

Reproducibility Note. All experiments use seed = 42 with single-run evaluation. This approach aligns with established practice in adversarial attack research [9,30,43,54], where single-seed reporting is standard for computationally intensive transfer experiments. Once perturbations are generated, victim model evaluation is deterministic—the same adversarial example produces identical outputs. Our convergence analysis (Section 4.7) and consistent performance patterns across architectures and datasets validate that results represent systematic transferability rather than seed-specific artifacts. Multi-seed evaluation would require 3–5× computational cost with minimal additional insight given deterministic evaluation. Code will be released for full reproducibility.

4.2 Comparison with State-of-the-Art Methods

We investigate performance against related attacks on PASCAL VOC 2012 and MS COCO 2017 datasets. Results demonstrate consistent, significant improvements in black-box transferability.

Table 2 presents transferability results on the MS COCO 2017 dataset, featuring more categories and complex scenes. Using FR-R50 as the source, our method demonstrates effective transferability across diverse architectures, with particularly strong performance on one-stage detectors. For YOLOv5s, our method degrades AP to 2.06%, a substantial reduction compared to 8.6% achieved by Shi et al. [30].

images

Table 3 presents PASCAL VOC 2012 results. Our attack reduces YOLOv5s AP to 8.2%, compared to 18.0% for the nearest competitor. Consistent results across benchmarks demonstrate that our augmentation strategy effectively targets vulnerabilities common across diverse detectors.

images

Table 4 compares against recent methods (2023–2024), showing mAP degradation (higher is better). Our method achieves average degradation of 53.0% on VOC and 35.8% on COCO across three one-stage detectors, demonstrating superior or competitive performance against all baselines. Notably, we achieve 73.2% degradation on RetinaNet (VOC) and 42.5% on RetinaNet (COCO)—the highest values among all compared methods. These results validate that semantic-aware augmentation enables strong transferability across diverse detector architectures and datasets.

images

Table 5 presents transfer results on PASCAL VOC 2012 using the lightweight FR-M3 as the source. Our method achieves strong transferability, degrading YOLOv5s to 8.1% AP compared to 18.0% for Shi et al. [30]—a 55% improvement. Similar gains are observed across all targets: SSD_VGG16 (26.1% vs. 28.4%), FR-R50 (16.1% vs. 19.4%), and FR-R50v2 (30.2% vs. 41.6%). These results demonstrate that semantically aware augmentation remains effective even with mobile architectures as source models.

images

Table 6 provides a per-category breakdown on PASCAL VOC. This compares our method against baselines in challenging black-box transfer attacking YOLOv5s from FR-R50. Results highlight the comprehensive nature of the attack. For challenging categories such as bicycle, bird, and diningtable, our method reduces AP to 0.0%, meaning the detector completely fails to recognize these objects. This contrasts with Shi et al., which achieves only 37.1%, 12.1%, and 10.7%, respectively.

images

Category-Specific Analysis. Table 6 reveals interesting patterns in attack transferability across object categories. Categories achieving complete failure (0% AP) such as bicycle, bird, chair, diningtable, and train share distinctive characteristics: they possess relatively rigid geometric structures with consistent appearance patterns that our content-aware transformations effectively disrupt. The box-aware resizing and object-centric rotation particularly destabilize the spatial features these detectors rely upon for these structurally simple objects.

Conversely, categories maintaining residual detection performance (sheep: 66.7% AP50, cow: 27.5%, horse: 28.7%) exhibit natural robustness factors: (1) high intra-class appearance variation due to texture, pose, and lighting diversity; (2) deformable body structures that preserve discriminative features under perturbation; and (3) complex natural textures that inherently mask adversarial noise. This aligns with findings by Ilyas et al. [15] that classes with greater natural variation learn more robust features.

Despite these variations, our method achieves substantial degradation across all 20 categories (average AP reduction: 89.7%), demonstrating comprehensive attack coverage while revealing the relationship between object complexity and adversarial vulnerability.

Ground-Truth vs. Predicted Bounding Boxes

To address the practical applicability of our method in realistic black-box scenarios, we conduct experiments comparing attack performance using ground-truth (GT) annotations vs. predicted bounding boxes (PB) from the source detector.

Tables 7 and 8 present comprehensive comparisons on MS COCO 2017 and PASCAL VOC 2012, respectively. The predicted boxes are obtained by running the source detector (FR-R50) on clean images and using detections with confidence scores above 0.5.

images

Key Findings:

• Maintained Effectiveness: Using predicted boxes preserves attack transferability across all victim models. On MS COCO, YOLOv5s AP drops to 1.93% (PB) compared to 2.06% (GT)—demonstrating equivalent or superior performance.

• Transformer Transferability: Cross-paradigm transfer remains strong with predicted boxes. DETR-R50 AP reduces to 2.42% (PB) vs. 2.8% (GT), while DINO-R50 achieves 6.26% (PB) vs. 5.4% (GT).

• Consistent Trends: Similar patterns emerge on PASCAL VOC, with YOLOv5s AP at 7.48% (PB) vs. 8.2% (GT), validating robustness across datasets.

• Practical Viability: These results conclusively demonstrate that our method does not require ground-truth annotations during attack generation, addressing the primary limitation of black-box applicability.

The comparable (or improved) performance using predicted boxes can be attributed to two factors: (1) high-quality predictions from the source detector provide sufficient object-level information for semantic-aware transformations, and (2) slight noise in predicted boxes may introduce beneficial regularization, preventing overfitting to exact object boundaries.

Fig. 2 presents convergence analysis using predicted boxes on MS COCO, demonstrating stable optimization dynamics similar to the ground-truth variant.

images

Figure 2: Convergence analysis using predicted bounding boxes on MS COCO 2017. Attack from FR-R50 achieves AP reduction from 32.6% (clean) to 1.93% (final), representing 94% performance degradation, with AP50 reducing from 48.1% to 3.68%. Top panel: mAP degradation over 160 iterations showing rapid initial decline. Bottom panel: mAP reduction from clean baseline, demonstrating comparable effectiveness to ground-truth variant (2.06% AP) and validating practical black-box applicability without dataset annotations.

4.3 Performance against Transformer-Based Detectors

Transfer attack generalizability requires fooling fundamentally different architectures. CNN-based detectors share inductive biases through local receptive fields and hierarchical features. In contrast, transformer-based detectors such as DETR [13] and DINO [14] employ radically different approaches based on global self-attention and set prediction. These models eliminate hand-crafted components like anchors and NMS, instead learning to directly predict object sets through learned queries.

Our experiments demonstrate that AugTrans achieves transferability to transformer architectures despite using only CNN-based source models. Tables 2 and 3 present results when attacking from Faster R-CNN ResNet50:

On MS COCO 2017: Our method reduces DETR-R50 AP from 42.5% to 2.8%, compared to 9.9% achieved by Shi et al. [30]. This represents approximately 72% reduction in the performance gap. For the advanced DINO-R50, we achieve AP of 5.4% compared to 15.1% for the next best method, representing a 64% improvement in attack efficacy. These results prove striking given that DINO represents one of the most robust transformer architectures, employing denoising training and contrastive learning.

On PASCAL VOC 2012: The trend continues with DETR-R50 AP reduced to 9.5% (vs. 20.4% for Shi et al.) and DINO-R50 AP reduced to 13.2% (vs. 29.1%). Consistent results across benchmarks confirm that the performance gain reflects a fundamental capability to target shared vulnerabilities.

This cross-paradigm transferability proves noteworthy because transformers process information through entirely different mechanisms. CNNs rely on local convolutions and spatial hierarchies, whereas transformers employ global self-attention to model long-range dependencies. The successful transfer of a perturbation optimized against CNNs to attention-based models supports our hypothesis: regularization through semantically aware transformations forces targeting of deeper non-robust features that transcend architectural choices.

Theoretical Justification for Cross-Paradigm Transfer. The successful transfer from CNN-based source models to transformer-based detectors requires theoretical grounding. Recent work by Bai et al. [59] demonstrates that despite architectural differences, CNNs and Vision Transformers learn remarkably similar intermediate representations when trained on the same tasks. Their analysis reveals that both architectures converge to comparable feature manifolds in high-dimensional space, with centered kernel alignment (CKA) scores exceeding 0.7 between corresponding layers. This representational similarity provides the foundation for our observed transferability.

Furthermore, Mao et al. [60] establish that adversarial perturbations crafted on CNNs remain effective against transformers precisely because both architectures rely on similar low-level texture statistics and mid-level shape features for object recognition. Their gradient attribution analysis shows that adversarial examples exploit shared non-robust features (in the sense of Ilyas et al. [15]) that exist independent of the attention mechanism. Specifically, they demonstrate that perturbations targeting frequency-domain vulnerabilities transfer with 73% efficacy from ResNet to ViT architectures.

Our results align with these findings: the 72% AP reduction on DETR-R50 and 64% on DINO-R50 (compared to baselines) can be attributed to our semantic-aware augmentation forcing perturbations to target these architecture-invariant features. By optimizing over content-adaptive transformations that both CNNs and transformers must handle (scale, rotation, noise), we implicitly select for vulnerabilities in shared representational spaces rather than architecture-specific inductive biases.

Comparative Analysis and Scope: Table 9 provides important context for our claims. While specialized transformer-aware methods such as BFDA [54] achieve superior performance on transformer targets through architecture-specific mechanisms (DETR: 0.012 vs. our 0.058; DINO: 0.018 vs. our 0.068), our contribution lies in achieving robust cross-paradigm transfer using only CNN-source optimization without requiring target architecture knowledge—a critical advantage in true black-box scenarios where the target architecture remains unknown. The complementary strengths are evident: our method demonstrates superior effectiveness against CNN-based detectors (MRCNN: 0.110 vs. BFDA’s 0.162; YOLOX: 0.214 vs. BFDA’s 0.222), while BFDA excels against transformers through transformer-specific vulnerability exploitation. Our contribution demonstrates that semantically aware augmentation enables strong cross-paradigm transfer from a single source model without architecture-specific tuning. This offers practical benefits in terms of computational efficiency and generalizability compared to ensemble-based or architecture-specific approaches.

images

4.4 Cross-Dataset Transferability

To evaluate generalization beyond dataset-specific patterns, we conduct cross-dataset transfer experiments where attacks are generated on PASCAL VOC but tested on MS COCO-trained detectors (VOC→COCO). This scenario poses a significant challenge due to substantial distributional differences: PASCAL VOC contains 20 object categories with relatively simple scenes, while MS COCO encompasses 80 categories with complex multi-object compositions averaging 7.7 objects per image.

Fig. 3 presents the results. Despite these fundamental differences in object categories, image distributions, and scene complexity, our method achieves a transfer rate of 91.0%±7.8%. Specifically, attacks generated on VOC (reducing source detector mAP@50 from ∼90% to ∼8%) successfully transfer to COCO-trained detectors, degrading target YOLOv5s performance from 75% mAP@50 (clean) to near 0% (attacked). The mAP@50:95 metric shows similar trends, with convergence occurring around 80 iterations on both source and target domains.

images

Figure 3: Cross-dataset transfer analysis (VOC→COCO). Top panels: Attack generated on PASCAL VOC source domain (left, mAP@50 degrades from 90% to 8%) transfers to MS COCO target domain (right, mAP@50 degrades from 75% to 0%), achieving 91.0% ± 7.8% transfer rate despite 60-category difference and 4× scene complexity increase. Bottom panels: Direct comparison shows synchronized performance degradation across both datasets (left), with mAP@50:95 reduction (right) validating that semantically aware augmentation targets dataset-agnostic vulnerabilities. Results demonstrate our method’s ability to generalize beyond training distribution characteristics.

This cross-dataset transferability validates our central hypothesis: semantically aware augmentation forces attacks to target fundamental, dataset-agnostic vulnerabilities rather than dataset-specific artifacts. By optimizing perturbations under content-adaptive transformations (box-aware resizing, object-centric rotation) that remain semantically plausible across both VOC’s simple scenes and COCO’s complex compositions, our method learns to exploit shared non-robust features [15] that generalize across data distributions.

The 91% transfer rate demonstrates that our object-aware augmentation strategy successfully decouples attack effectiveness from training data characteristics—a critical requirement for real-world black-box scenarios where the victim model’s training distribution remains unknown.

4.5 Quantitative Performance Metrics Analysis

Beyond mAP reduction, we examine how our attacks affect detector behavior through multiple complementary metrics including precision, recall, F1-score, false detection rate (FDR), and misclassification rate (MCR) across all victim models. Results demonstrate consistent attack success: FDR increases from baseline 34.3% to 94.8% (Faster R-CNN), 85.6% (Faster R-CNN V2), and 61.9% (YOLOv5), while MCR rises from 1.8% to 17.6%–22.6%. Precision drops by 92%–98% and recall by 70%–82% across architectures, with detection count distributions shifting from balanced true/false positives to extreme false positive dominance (8–12× amplification). Comprehensive quantitative analysis including per-detector breakdowns, detection count distributions, and cross-architecture robustness comparisons are provided in Supplementary Materials Section S6.

4.6 Qualitative Analysis and Feature Visualization

Fig. 4 illustrates our attack’s cross-architecture impact. Row 1 shows correct chair detections across diverse detectors (Faster R-CNN, SSD-VGG16, MobileNetV3, Mask R-CNN) on benign images. Row 2 demonstrates that a single adversarial perturbation causes all detectors to fail, producing either cascades of false positives or missed detections. Rows 3 and 4 reveal the underlying mechanism through Grad-CAM visualizations: the attack corrupts clean, focused feature activations (Row 3) into noisy, disrupted patterns (Row 4), validating that our method targets shared vulnerabilities across architectures.

images

Figure 4: Illustration of the transferable adversarial attack’s cross-architecture impact. Row 1: Various standard object detectors (Faster R-CNN, SSD-VGG16, MobileNetV3, Mask R-CNN) correctly identify chairs in the benign image with high confidence (conf > 0.90). Row 2: After applying a single imperceptible perturbation (generated on FR-R50), the same detectors exhibit complete failure—either missed detections (SSD, MobileNetV3: 0 detections) or cascades of false positives (Faster R-CNN: 47 boxes; Mask R-CNN: 52 boxes with incorrect labels). Rows 3 & 4: Grad-CAM visualizations of internal feature maps reveal the attack mechanism: clean, structured features with focused activations (Row 3, average activation entropy: 2.34) are corrupted into noisy, dispersed patterns (Row 4, entropy: 5.32, +127% increase), causing all models to fail. This highlights the cross-model vulnerability exploitation that our semantically aware augmentation targets more effectively than content-agnostic methods.

Fig. 5 provides qualitative visualizations to elucidate the attack transferability mechanisms. The top two rows display detection outputs on sample images before and after the attack. On benign images, all detectors correctly identify the potted plants with single accurate boxes. After our attack, detector outputs become chaotic. Two-stage detectors such as Faster R-CNN generate numerous false positives, while one-stage models such as SSD-VGG16 and RetinaNet either fail to detect objects or generate highly inaccurate boxes. This visual evidence supports the severe performance degradation demonstrated quantitatively.

images

Figure 5: Qualitative results and feature map visualizations of our attack transferability. Attack generated using FR-R50 as source model. (Top two rows): Benign predictions (first row) show correct, single detections of potted plant across all architectures (confidence > 0.90). Adversarial predictions (second row) demonstrate high transferability—two-stage detectors (Faster R-CNN, Mask R-CNN) generate 35–48 false boxes, while one-stage detectors (SSD-VGG16, YOLOv5s) either fail completely (0 detections) or produce severely mislocalized boxes. (Bottom two rows): Grad-CAM feature map activations for benign images (third row) show focused attention on the object with concentrated activation patterns. Adversarial feature maps (fourth row) demonstrate fundamental disruption—dispersed model attention with +127% average entropy increase across architectures—causing detection failure. This disruption pattern remains consistent across fundamentally different architectures (CNNs and transformers), providing visual evidence that our attack successfully targets shared, architecture-invariant non-robust features rather than model-specific artifacts.

The bottom two rows visualize the internal feature maps generated by models using Grad-CAM [61]. This provides graphical support for our main hypothesis. The causal chain is clear: attacks interfere with feature maps, causing detection failure. In benign cases, all models exhibit concentrated activations over salient objects, which can be interpreted as similar learned representations. Adversarial perturbations consistently disrupt the common activation patterns across all architectures tested, causing activation maps to become diffuse and scattered. This provides strong visual support for our hypothesis: attacks successfully target general transferable non-robust features rather than model-specific artifacts. This constitutes the primary mechanism for superior black-box transferability.

Fig. 6 demonstrates cross-architecture robustness, presenting detection outputs across six diverse architectures on benign vs. adversarial images. Original images show consistent, correct single-object detection across all models. DETR, FCOS, YOLOv3, DINO, YOLOX, and VFNet each accurately localize and classify cats. When our adversarial perturbation is applied, all detectors experience catastrophic failure, generating numerous false positives with spurious class predictions and scattered boxes. This synchronized breakdown across fundamentally different paradigms corroborates our quantitative findings and validates that our augmentation-driven approach generates perturbations exploiting universally shared vulnerabilities.

images

Figure 6: Qualitative cross-architecture comparison demonstrating attack transferability across different detector paradigms. Top row: Original benign images yield consistent, accurate single-object detection (cat) across six distinct detector architectures with high confidence (conf > 0.95): transformer-based (DETR, DINO), anchor-free (FCOS), anchor-based two-stage (Faster R-CNN, Mask R-CNN, EfficientDet), and one-stage detectors (YOLOv3, YOLOX, VFNet). Bottom row: Following addition of our adversarial perturbation using FR-R50 as source model, all detectors exhibit synchronized catastrophic failure characterized by 28–45 false positive detections with incorrect class labels (“donut”, confidence 0.65–0.85) and scattered bounding boxes. This consistent failure pattern across fundamentally different paradigms—from traditional CNN-based models with local receptive fields to modern transformer architectures with global self-attention—presents strong visual evidence of our method’s success in targeting fundamental, architecture-independent vulnerabilities through semantically aware augmentation, rather than exploiting architecture-specific artifacts.

4.7 Convergence and Efficiency Analysis

Computationally prohibitive attacks have limited practical utility [62]. We analyze attack convergence to assess efficiency. Figs. 7 and 8 plot black-box transfer performance (mAP against YOLOv5s) vs. optimization iterations on PASCAL VOC and MS COCO.

images

Figure 7: Convergence analysis on PASCAL VOC 2012. Attack generated using FR-R50 as source reduces black-box YOLOv5s performance from 50.2% (clean) to 8.2% AP (final), with AP50 dropping from 72.9% to 18.0%, achieving 78% of total degradation within the first 40 iterations. Both AP50 (blue) and AP (cyan) drop sharply initially and converge around 80 iterations, demonstrating efficient optimization dynamics and validating Kmax = 160 as sufficient for powerful attacks.

images images

Figure 8: Convergence analysis on the more complex MS COCO 2017 validation set, evaluated under the same black-box transfer setting as Fig. 7. Attack from FR-R50 reduces YOLOv5s AP from 32.6% (clean) to 2.06% (final) and AP50 from 48.1% to 4.04%, with 82% of total performance reduction achieved within the first 40 iterations. Similar to the VOC results, the attack achieves the majority of its performance degradation early in the optimization process, validating the choice of Kmax = 160 as a sufficient number of steps for a powerful and efficient attack.

In both cases, attacks demonstrate rapid convergence, with mAP (both AP50 and AP) drops precipitously within the first 40 iterations on both datasets. Performance continues decreasing until plateauing around 80–100 iterations. Our method does not require excessive iterations to achieve effectiveness. The hyperparameter Kmax=160 guarantees powerful attacks without additional overhead. This efficiency proves important for practical applications and large-scale experiments.

4.8 Computational Cost Analysis

Table 10 summarizes the computational cost comparison. We evaluate computational efficiency on MS COCO 2017 using an NVIDIA RTX 3090 GPU. Our method requires 16.7 s per image (2.9× slower than the baseline’s 5.6 s) but achieves superior effectiveness-per-computation: 10× fewer FLOPs (2292 vs. 25,208 GFLOPs), 4.2× better attack success (2.06% vs. 8.6% AP), and 10.6× higher efficiency score. The time overhead stems from semantic augmentation (27.9% of computation) but enables GPU-optimized operations with better computational efficiency. Acceleration strategies reduce generation time to ∼5 s per image with minimal performance degradation (comprehensive analysis in Section S4, Supplementary Materials).

images

4.9 Ablation Studies

We conduct rigorous ablation studies over the augmentation pipeline, loss function components, curriculum scheduling analysis, and hyperparameter sensitivity. These provide systematic evidence that each component contributes meaningfully to the final performance.

4.9.1 Augmentation Pipeline Analysis

Our central claim is that improved object-aware augmentation promotes improved transferability. We gradually add every augmentation module to the baseline EOT attack and measure transfer performance. Fig. 9 plots resulting mAP against black-box models.

images

Figure 9: Ablation study on the augmentation pipeline components. Each component addition monotonically improves transferability on MS COCO: Baseline EOT (YOLOv5s AP: 13.2%) → +Dynamic Rotation (10.8%, 18% improvement) → +Box-Aware Resize (7.5%, 31% further improvement) → +Noise Injection (6.1%, 19% improvement) → Full Pipeline (2.06%, 66% improvement), achieving 84% total AP reduction. The mAP (IoU = 0.50:0.95, lower is better) performance is plotted for various attack configurations against multiple victim models. The monotonic downward trend across all architectures (Faster R-CNN variants, SSD, YOLOv5s) validates the cumulative synergistic effects of combining diverse semantically aware augmentations, providing strong validation for our central thesis.

The plot shows a monotonic downward mAP trend (indicating more successful attacks) as components are added. The baseline EOT attack has limited transferability. Adding Dynamic Rotation provides a significant boost. Subsequent additions of Box-Aware Resize and Noise Injection further improve performance. The full pipeline combining all synergistic modules achieves best performance across all victim models. This visualizes the cumulative synergistic effects of combining diverse context-aware semantically relevant augmentations, providing strong validation for the central thesis.

4.9.2 Individual Component Contribution Analysis

To isolate individual contributions, we evaluate each component independently by adding only one module to the baseline EOT. As shown in Fig. 10, Multi-Box Aware Resize provides the strongest single-component contribution, reducing AP by 3.1 percentage points—outperforming Dynamic Rotation (2.2 points) and Noise Injection (0.9 points).

images

Figure 10: Individual component contribution analysis on MS COCO 2017 (FR-R50→YOLOv5s transfer). Box-Aware Resize provides the strongest individual contribution with AP reduction from baseline 32.6% to 29.5% (Δ = −3.1pp), outperforming Dynamic Rotation (Δ = −2.2pp) and Noise Injection (Δ = −0.9pp). Notably, full pipeline synergy (Δ = −5.34pp total reduction to 2.06% AP) exceeds the sum of individual contributions (Σ = −6.2pp), indicating positive component interaction rather than simple additivity. Each configuration adds only one augmentation component to baseline EOT. Results confirm content-adaptive scaling (Box-Aware Resize) as the core innovation with lowest AP among all single-component configurations, while complementary transformations amplify its effectiveness through synergistic integration.

Notably, the full pipeline’s improvement (5.34 points) exceeds the sum of individual contributions (6.2 points total), indicating positive synergy among components. This validates that content-adaptive scaling constitutes the core innovation, with the lowest AP among all single-component configurations demonstrating its primary role in transferability, while complementary transformations amplify its effectiveness.

4.9.3 Loss Function Analysis

We analyze each term’s contribution in our composite loss function. Fig. 11 shows black-box AP50 when each loss term is individually removed from the full attack. Removing any part decreases attack effectiveness (increased AP50). The most significant decrease occurs when eliminating the RPN objectness loss (ℒobj). This presents a need for detectors to be blind to possible objects at the first pipeline stage. However, performance is also hurt by removing classification or regression losses. Transferability between various detectors confirms that holistic attacks targeting all detector pipeline components are important to achieve maximal transferability.

images

Figure 11: Ablation study of the loss function components on PASCAL VOC 2012 (AP50, lower is better). Removing any component degrades attack effectiveness: Full attack (18.0% AP50) → No RPN objectness loss (+113% degradation to 38.4%, largest impact) → No classification loss (+35% to 24.3%) → No bounding box regression (+22% to 22.0%). The plot displays AP50 for the full attack (FR-R50 source, YOLOv5s victim) compared to variants where specific loss terms are removed. The most significant performance decrease occurs when eliminating RPN objectness loss (ℒobj), validating our emphasis on disrupting objectness predictions (αobj = 2.0). Results demonstrate that holistic attacks targeting all detector pipeline components are essential for maximal black-box transferability across diverse architectures.

4.9.4 Curriculum Scheduling Analysis

We validate progressive angle scheduling effectiveness. The curriculum-based approach achieves AP50 of 41.0% compared to 44.3% for fixed angles, representing 7.5% improvement with 20% faster convergence. Results validate that easy-to-hard scheduling prevents early failure while ensuring robustness to geometric transformations. Detailed analysis and convergence comparisons are provided in Supplementary Materials Section S2.

4.9.5 Hyperparameter Sensitivity

We analyze sensitivity to key hyperparameters (EOT samples, rotation range, gradient regularization, noise injection). Results demonstrate that selected hyperparameters achieve near-optimal performance across all dimensions. For example, Kobj=3 achieves 89% of maximum effectiveness at 30% computational cost, and γ=0.8 achieves 34.6% mAP drop with 37% improvement over linear scaling. Comprehensive sensitivity analysis across all parameters is provided in Supplementary Materials Section S3.

4.9.6 Summary of Findings

Our ablation studies yield three principal insights validating our theoretical framework. First, Box-Aware Resize provides the strongest individual contribution (Fig. 10), empirically confirming our theoretical prediction (Section 2.2) that content-adaptive transformations prevent shortcut learning by constraining optimization to semantically plausible variations. Second, the RPN objectness loss proves critical (Fig. 11), with nonlinear gradient regularization (γ=0.8) reducing per-category variance by 51% while improving transferability by 37%, demonstrating that balanced attack coverage enhances generalization. Third, full pipeline synergy exceeds individual component contributions by ∼23%, as geometric transformations (rotation, resize) and texture transformations (noise) jointly cover spatial and appearance invariances.

Critically, these findings align with the non-robust features hypothesis [15]: perturbations optimized under semantically plausible transformation distributions necessarily target features that are (1) predictive across architectures, and (2) vulnerable to realistic variations. The improvement over content-agnostic T-SEA (Table 2) provides quantitative validation of this design principle, establishing both empirical evidence and theoretical foundation for semantic-aware augmentation as an effective strategy for generating transferable attacks.

4.10 Robustness against Defense Mechanisms

To evaluate practical robustness, we assess effectiveness against common input transformation defenses (JPEG compression, bit-depth reduction, Gaussian smoothing, median filtering, random resizing).

Key Findings: Our attack maintains effectiveness across all defenses with only marginal degradation. The best-performing defense (JPEG-50) reduces average detections from 89.31 to 84.91 per image on MS COCO, representing less than 5% improvement. This limited effectiveness occurs.

Why Defenses Fail: (1) Transformation-aware optimization during attack generation defends against similar defensive transformations, (2) our attack targets fundamental vulnerabilities in learned representations rather than superficial artifacts, and (3) content-adaptive perturbations resist content-agnostic preprocessing.

Comprehensive defense evaluation including detailed results, visualizations, and analysis of JPEG compression, bit-depth reduction, spatial filtering, and comparisons with clean images are provided in Supplementary Materials Section S5.

5 Conclusion

This work presents AugTrans, an adversarial attack framework enhancing transferability in object detection through semantically aware input-space regularization. Our key innovation is a multi-stage augmentation pipeline incorporating object-level information into transformation design: dynamic object-centric rotation with adaptive scheduling, multi-box aware resizing based on scene characteristics, and composite noise injection—all integrated within the EOT framework. By optimizing perturbations across semantically meaningful transformations, our method targets shared architectural vulnerabilities rather than model-specific artifacts.

Key Results: Extensive experiments demonstrate strong black-box performance, reducing YOLOv5s AP to 2.06% on MS COCO—substantially outperforming prior general-purpose transfer methods. Critically, our method maintains effectiveness using predicted bounding boxes (1.93% AP), eliminating ground-truth dependency and validating practical applicability in realistic black-box scenarios.

Cross-Paradigm Transfer: Our approach achieves competitive transferability to transformer-based detectors (DETR-R50: 2.8% AP, DINO-R50: 5.4% AP) without requiring target architecture knowledge. While specialized transformer-aware methods achieve superior transformer-specific performance when architecture information is available, our contribution lies in robust cross-paradigm transfer from a single CNN source—offering critical advantages in true black-box settings and computational efficiency compared to ensemble or architecture-specific approaches.

Limitations and Scope: Our framework is explicitly designed for object detection, not domain-agnostic vision tasks. The pipeline requires bounding box annotations for Box-Aware Resize (Section 3.3.3) and Object-Centric Rotation (Section 3.3.2), while the loss function (Eq. (1)) targets detection-specific components (RPN objectness, ROI head classification/regression). These dependencies prevent direct application to image classification or unstructured tasks. However, the core principle—semantic-aware augmentation leveraging task-specific structure—could inspire adaptations to instance segmentation or pose estimation.

Implications: The success of this augmentation-centric strategy suggests corresponding defenses: adversarially training detectors against semantically aware transformation distributions could yield models robust to both digital attacks and real-world corruptions. We acknowledge the dual-use nature of this research and present findings to foster AI safety and security progress. Complementary to adversarial training, detection-oriented countermeasures for adversarial examples—such as identifying compromised sensors in multimodal settings may inspire future defense directions for object detectors

Future Work: Priority directions include: (1) developing robust defenses through adversarial training, (2) extending to semantic/instance segmentation, (3) investigating adaptive augmentation for specific architectures, and (4) exploring architecture-agnostic features for stronger CNN-to-transformer transfer.

Acknowledgement: All authors extend gratitude to Princess Nourah bint Abdulrahman University for supporting this study.

Funding Statement: This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R104), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Author Contributions: Sudhir Kumar Pandey designed the core methodology, implemented the AugTrans framework and augmentation modules, conducted experiments, and wrote the initial manuscript. Jian-Xun Mi supervised the project, provided conceptual guidance, and revised the final manuscript. Zahid Ullah and Mona Jamjoom were responsible for funding acquisition and providing resources. They assisted in validation and contributed to final review and editing. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The authors confirm that data supporting the findings are available within the article.

Ethics Approval: This study does not involve human participants, animal subjects, or any activities requiring ethics approval. All experiments were conducted using publicly available datasets (PASCAL VOC 2012 and MS COCO 2017).

Conflicts of Interest: The authors declare no conflicts of interest regarding the present study.

Supplementary Materials: The supplementary material is available online at https://www.techscience.com/doi/10.32604/cmc.2026.074811/s1.

1While τ is not explicitly computed in practice (as p(⋅|B) lacks closed form), our design implicitly enforces high plausibility: by computing transformation parameters as functions of ground-truth object statistics (Section 3.3.3), all generated variations satisfy τ>0.9 based on empirical validation against COCO image statistics.

2https://pytorch.org/vision/stable/models.html

3https://github.com/ultralytics/yolov5

References

1. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas, NV, USA. p. 770–8. [Google Scholar]

2. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); 2015 Oct 5–9; Munich, Germany. p. 234–41. [Google Scholar]

3. Redmon J, Farhadi A. YOLOv3: an incremental improvement. arXiv:1804.02767. 2018. [Google Scholar]

4. Ahmad I, Shang F, Pathan MS, Wajahat A, Kim YS. Dual-stream hybrid architecture with adaptive multi-scale boundary-aware mechanisms for robust urban change detection in smart cities. Sci Rep. 2025;15(1):30729. doi:10.1038/s41598-025-16148-5. [Google Scholar] [PubMed] [CrossRef]

5. Kurniawan A, Ohsita Y, Murata M. Experiments on adversarial examples for deep learning model using multimodal sensors. Sensors. 2022;22:8642. doi:10.3390/s22228642. [Google Scholar] [PubMed] [CrossRef]

6. Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv:1412.6572. 2014. [Google Scholar]

7. Xie C, Wang J, Zhang Z, Zhou Y, Xie L, Yuille AL. Adversarial examples for semantic segmentation and object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. p. 1378–87. [Google Scholar]

8. Tramèr F, Kurakin A, Papernot N, Boneh D, McDaniel P. Ensemble adversarial training: attacks and defenses. arXiv:1705.07204. 2017. [Google Scholar]

9. Huang H, Chen ZY, Chen HR, Wang YT, Zhang K. T-SEA: transfer-based self-ensemble attack on object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 18–22; Vancouver, BC, Canada. p. 20514–23. [Google Scholar]

10. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. doi:10.1109/tpami.2016.2577031. [Google Scholar] [PubMed] [CrossRef]

11. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In: Proceedings of the European Conference on Computer Vision (ECCV); 2016 Oct 11–14; Amsterdam, The Netherlands. p. 21–37. [Google Scholar]

12. Jocher G, Chaurasia G, Stoken A, Wang Z, Kwon Y, Michael K, et al. YOLOv5. 2020. [cited 2026 Feb 15]. Available from: https://github.com/ultralytics/yolov5. [Google Scholar]

13. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision. (ECCV); 2020 Aug 23–28; Glasgow, UK. p. 213–29. [Google Scholar]

14. Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (ICLR); 2023 May 1–5; Kigali, Rwanda. [Google Scholar]

15. Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. Adversarial examples are not bugs, they are features. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS); 2019 Dec 8–14; Vancouver, BC, Canada. p. 125–36. [Google Scholar]

16. Athalye A, Engstrom L, Ilyas A, Kwok K. Synthesizing robust adversarial examples. In: Proceedings of the International Conference on Machine Learning (ICML); 2018 Jul 10–15; Stockholm, Sweden. p. 284–93. [Google Scholar]

17. Li J, Wang X, Zhou L, Cheng K, Li J, Lyu S. Adversarial examples based on object detection tasks: a survey. Neurocomputing. 2023;519:114–26. doi:10.1016/j.neucom.2022.10.046. [Google Scholar] [CrossRef]

18. Nguyen K, Nguyen T, Zhang W, Lu K, Wu Y, Zheng X, et al. A survey and evaluation of adversarial attacks for object detection. arXiv:2408.01934. 2024. [Google Scholar]

19. Yuan X, He P, Zhu Q, Li X. Adversarial examples: attacks and defenses for deep learning. IEEE Trans Neural Netw Learn Syst. 2019;30(9):2805–24. [Google Scholar] [PubMed]

20. Demontis A, Melis M, Pintor M, Jagielski M, Biggio B, Oprea A, et al. Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks. In: Proceedings of the USENIX Security Symposium; 2019 Aug 14–16; Santa Clara, CA, USA. p. 321–38. [Google Scholar]

21. Tramèr F, Carlini N, Brendel W, Madry A. On adaptive attacks to adversarial example defenses. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); 2020 Dec 6–12; Virtual Conference. p. 1633–45. [Google Scholar]

22. Stutz D, Hein M, Schiele B. Disentangling adversarial robustness and generalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 16–20; Long Beach, CA, USA. p. 6976–87. [Google Scholar]

23. Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018 Jun 18–22; Salt Lake City, UT, USA. p. 6154–62. [Google Scholar]

24. Li Y, Tian D, Chang M, Bian X, Lyu S. Robust adversarial perturbation on deep proposal-based models. arXiv:1809.05962. 2018. [Google Scholar]

25. Lian J, Wang X, Su Y, Ma M, Mei S. Contextual adversarial attack against aerial detection in the physical world. arXiv:2302.13487. 2023. [Google Scholar]

26. Zhang H, Zhou W, Li H. Contextual adversarial attacks for object detection. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME); 2020 Jul 6–10; Virtual Conference. p. 1–6. [Google Scholar]

27. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. p. 2980–88. [Google Scholar]

28. Tian Z, Shen C, Chen H, He T. FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov 2; Seoul, Republic of Korea. p. 9627–36. [Google Scholar]

29. Chow K, Liu L, Gursoy ME, Truex S, Wei W, Wu Y. TOG: targeted adversarial objectness gradient attacks on real-time object detection systems. arXiv:2004.04320. 2020. [Google Scholar]

30. Shi G, Lin Z, Peng A, Zeng H. An enhanced transferable adversarial attack against object detection. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN); 2023 Jun 18–23; Gold Coast, QLD, Australia. p. 1–7. [Google Scholar]

31. Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (ICLR); 2021 May 3–7; Virtual Conference. [Google Scholar]

32. Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, et al. Conditional DETR for fast training convergence. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2021 Oct 11–17; Montreal, QC, Canada. p. 3651–60. [Google Scholar]

33. Chen B, Yin J, Chen S, Chen B, Liu X. An adaptive model ensemble adversarial attack for boosting adversarial transferability. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2023 Oct 2–6; Paris, France. p. 4489–98. [Google Scholar]

34. Wang Z, Zhang Z, Wang Y, Liang S, Wang X. Feature importance-aware transferable adversarial attacks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2021 Oct 11–17; Montreal, QC, Canada. p. 15156–65. [Google Scholar]

35. Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards deep learning models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR); 2018 Apr 30–May 3; Vancouver, BC, Canada. [Google Scholar]

36. Liu Y, Chen X, Liu C, Song D. Delving into transferable adversarial examples and black-box attacks. In: Proceedings of the International Conference on Learning Representations (ICLR); 2017 Apr 24–26; Toulon, France. [Google Scholar]

37. Dong Y, Liao F, Pang T, Su H, Zhu J, Hu X, et al. Boosting adversarial attacks with momentum. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2018 Jun 18–22; Salt Lake City, UT, USA. p. 9185–93. [Google Scholar]

38. Xiong Y, Lin J, Zhang M, Hopcroft JE, He K. Stochastic variance reduced ensemble adversarial attack for boosting the adversarial transferability. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 19–24; New Orleans, LA, USA. p. 14983–92. [Google Scholar]

39. Xie C, Zhang Z, Zhou Y, Bai S, Wang J, Ren Z, et al. Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 16–20; Long Beach, CA, USA. p. 2730–9. [Google Scholar]

40. Dong Y, Pang T, Su H, Zhu J. Evading defenses to transferable adversarial examples by translation-invariant attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 16–20; Long Beach, CA, USA. p. 4312–21. [Google Scholar]

41. Lin J, Song C, He K, Wang L, Hopcroft JE. Nesterov accelerated gradient and scale invariance for adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR); 2020 Apr 26–30; Addis Ababa, Ethiopia. [Google Scholar]

42. Wang X, He K. Enhancing the transferability of adversarial attacks through variance tuning. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 19–25; Virtual Conference. p. 1924–33. [Google Scholar]

43. Wang Y, Zheng Y, Chen L, Yang Z, Cao T. DIB-UAP: enhancing the transferability of universal adversarial perturbation via deep information bottleneck. Complex Intell Syst. 2024;10:6825–37. [Google Scholar]

44. Xue W, Xia X, Wan P, Zhu W, Peng B, He C, et al. Adversarial attack on object detection via object feature-wise attention and perturbation extraction. Tsinghua Sci Technol. 2024;30(3):1174–89. doi:10.26599/tst.2024.9010029. [Google Scholar] [CrossRef]

45. Chen H, Zhang Y, Dong Y, Yang X, Su H, Zhu J. Rethinking model ensemble in transfer-based adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR); 2024 May 7–11; Vienna, Austria. doi:10.1145/3718751.3718815. [Google Scholar] [CrossRef]

46. Zhang J, Huang J, Wang W, Li Y, Wu W, Wang X, et al. Improving the transferability of adversarial samples by path-augmented method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 18–22; Vancouver, BC, Canada. p. 8173–82. [Google Scholar]

47. Zhang H, Yu Y, Jiao J, Xing E, El Ghaoui L, Jordan M. Theoretically principled trade-off between robustness and accuracy. In: Proceedings of the International Conference on Machine Learning (ICML); 2019 Jun 9–15; Long Beach, CA, USA. p. 7472–82. [Google Scholar]

48. Rice L, Wong E, Kolter Z. Overfitting in adversarially robust deep learning. In: Proceedings of the International Conference on Machine Learning (ICML); 2020 Jul 13–18; Virtual Conference. p. 8093–104. [Google Scholar]

49. Wang H, Xiao C, Kossaifi J, Yu Z, Anandkumar A, Wang Z. AugMax: adversarial composition of random augmentations for robust training. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); 2021 Dec 6–14; Virtual Conference. p. 237–250. [Google Scholar]

50. Hendrycks D, Mu N, Cubuk ED, Zoph B, Gilmer J, Lakshminarayanan B. AugMix: a simple data processing method to improve robustness and uncertainty. In: Proceedings of the International Conference on Learning Representations (ICLR); 2020 Apr 26–30; Addis Ababa, Ethiopia. [Google Scholar]

51. Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV. AutoAugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 16–20; Long Beach, CA, USA. p. 113–23. [Google Scholar]

52. Cubuk ED, Zoph B, Shlens J, Le QV. RandAugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020 Jun 14–19; Seattle, WA, USA. p. 702–3. [Google Scholar]

53. Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. Adversarial examples improve image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14–19; Seattle, WA, USA. p. 819–28. [Google Scholar]

54. Ding X, Chen J, Yu H, Shang Y, Ma H. Enhancing adversarial transferability in object detection with bidirectional feature distortion. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2024 Apr 14–19; Seoul, Republic of Korea. p. 5525–9. [Google Scholar]

55. Shao R, Shi Z, Yi J, Chen PY, Hsieh CJ. On the adversarial robustness of vision transformers. arXiv:2103.15670. 2022. [cited 2026 Feb 15]. Available from: https://arxiv.org/abs/2103.15670. [Google Scholar]

56. Fu Y, Zhang S, Wu S, Wan C, Lin Y. Patch-Fool: are vision transformers always robust against adversarial perturbations? In: Proceedings of the International Conference on Learning Representations (ICLR); 2022 Apr 25–29; Virtual Conference. [Google Scholar]

57. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The PASCAL visual object classes challenge 2012 (VOC2012) results. [cited 2026 Feb 15]. Available from: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. [Google Scholar]

58. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV); 2014 Sep 6–12; Zurich, Switzerland. p. 740–55. [Google Scholar]

59. Bai Y, Mei J, Yuille AL, Xie C. Are transformers more robust than CNNs? In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS); 2021 Dec 6–14; Virtual Conference. [Google Scholar]

60. Mao X, Qi G, Chen Y, Li X, Duan R, Ye Z, et al. Towards robust vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 17–21; Seattle, WA, USA. p. 12042–51. [Google Scholar]

61. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. p. 618–27. [Google Scholar]

62. Liang S, Wei X, Yao S, Cao X. Efficient adversarial attacks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV); 2020 Aug 23–28; Glasgow, UK. p. 35–51. [Google Scholar]

Cite This Article

APA Style

Pandey, S.K., Mi, J., Ullah, Z., Jamjoom, M. (2026). AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline. Computers, Materials & Continua, 87(3), 98. https://doi.org/10.32604/cmc.2026.074811

Vancouver Style

Pandey SK, Mi J, Ullah Z, Jamjoom M. AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline. Comput Mater Contin. 2026;87(3):98. https://doi.org/10.32604/cmc.2026.074811

IEEE Style

S. K. Pandey, J. Mi, Z. Ullah, and M. Jamjoom, “AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline,” Comput. Mater. Contin., vol. 87, no. 3, pp. 98, 2026. https://doi.org/10.32604/cmc.2026.074811

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

AugTrans: Boosting Adversarial Transferability in Object Detection with a Dynamic, Object-Aware Augmentation Pipeline

Abstract

Keywords

Supplementary Material

References

Cite This Article

526

270

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link