An adversarial attack is a deliberate attempt to cause a machine learning model to make a mistake by feeding it a specially crafted input, known as an adversarial example. These attacks are a cornerstone of adversarial testing, a systematic evaluation method that exposes model weaknesses. Attacks are broadly categorized by an adversary's knowledge: white-box attacks assume full model access, while black-box attacks rely only on input-output queries.
Glossary
Adversarial Attack

What is an Adversarial Attack?
A core concept in AI security and evaluation, defining a critical method for probing model vulnerabilities.
The primary goal is to assess and improve adversarial robustness. Common techniques include the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). Defenses like adversarial training incorporate these attacks during model development. This practice is integral to preemptive algorithmic cybersecurity, ensuring models are reliable against manipulation in production environments.
Core Characteristics of Adversarial Attacks
Adversarial attacks exploit the sensitivity of machine learning models to small, often imperceptible, input perturbations. These systematic methods are defined by their objectives, constraints, and the attacker's knowledge.
Attack Objective: Targeted vs. Untargeted
The fundamental goal of an attack defines its objective.
- Untargeted Attack: The adversary's goal is to cause any misclassification. The model's output must simply differ from the true label. This is often easier to achieve and measures a model's general vulnerability.
- Targeted Attack: The adversary aims to force the model to predict a specific, incorrect class. This requires more precise control over the perturbation and is a stronger test of a model's robustness, simulating scenarios where an attacker wants a system to behave in a predetermined, malicious way.
Perturbation Constraint: Imperceptibility
A defining feature of adversarial examples is that the perturbation is small, often measured by an L-p norm.
- L∞ Norm (Max Perturbation): Bounds the maximum change allowed to any single feature (e.g., pixel). The Fast Gradient Sign Method (FGSM) uses this constraint. Perturbations are often subtle but spread across the entire input.
- L2 Norm (Euclidean Distance): Bounds the overall magnitude of the change. Attacks like Carlini & Wagner (C&W) optimize under this constraint, often resulting in more diffuse, less locally intense noise.
- L0 Norm (Sparsity): Counts the number of altered features. The One-Pixel Attack is an extreme example, changing minimal features to maximize impact, testing a model's sensitivity to sparse anomalies.
Attacker Knowledge: White-Box vs. Black-Box
The attacker's assumed access to the target model dictates the attack strategy.
- White-Box Attack: The attacker has full knowledge of the model, including its architecture, parameters (weights), and training data. This allows direct computation of gradients to craft highly effective perturbations. FGSM, PGD, and C&W are classic white-box methods used for robustness benchmarking.
- Black-Box Attack: The attacker has no internal knowledge, treating the model as an oracle that returns only outputs (e.g., labels or confidence scores). Attacks rely on query-based strategies or transferability. An adversary might:
- Use a surrogate model to craft examples that transfer.
- Perform gradient estimation via finite differences.
- Use evolutionary algorithms. This scenario is more realistic for attacking deployed APIs.
Attack Sphere: Digital vs. Physical
Attacks are executed in different domains, imposing unique constraints.
- Digital Attacks: Perturbations are applied directly to the digital input (e.g., image pixels, text tokens). This allows for precise, sub-pixel manipulations. Most research and initial vulnerability discovery happens in this controlled domain.
- Physical Attacks: Perturbations must survive real-world transformations like varying viewpoints, lighting, and camera noise. These often take the form of patch attacks (e.g., a sticker on a stop sign) or texture changes. They require the perturbation to be robust to affine transformations and are critical for evaluating systems like autonomous vehicles or facial recognition in the wild.
Temporal Phase: Evasion vs. Poisoning
Attacks occur at different stages of the machine learning lifecycle.
- Evasion Attack (Inference-Time): This is the classic adversarial attack. A malicious input is crafted to fool an already-trained model during deployment. The model's parameters are fixed. Defenses include adversarial training and input sanitization.
- Poisoning Attack (Training-Time): The adversary corrupts the model's training process. By injecting malicious data into the training set, they create a backdoor or degrade overall performance. The compromised model then behaves incorrectly on specific triggered inputs or general data. This attacks the integrity of the learning pipeline itself and is harder to detect post-deployment.
The Transferability Property
A critical and concerning characteristic where an adversarial example crafted for one model is effective against a different, unknown model.
- Mechanism: Transferability arises because adversarial perturbations often exploit non-robust features—patterns in the data that are predictive but not semantically meaningful to humans—that are learned by many models trained on similar data.
- Implication: This enables practical black-box attacks. An attacker can train their own surrogate model, generate white-box attacks against it, and have a high probability of fooling the target black-box model. It also means that robustness cannot be achieved through security through obscurity (e.g., hiding model architecture).
- Defensive Challenge: Techniques like gradient masking may stop white-box attacks on a specific model but often fail against transfer attacks from a different surrogate, providing a false sense of security.
Types of Adversarial Attacks
A systematic comparison of adversarial attack methodologies, categorized by the attacker's knowledge, the attack's objective, and its execution domain.
| Attack Feature / Dimension | White-Box Attack | Black-Box Attack | Physical-World Attack |
|---|---|---|---|
Attacker Knowledge | Full access to model architecture, parameters, and gradients. | Access limited to model inputs and outputs (API queries). | No direct model access; attacks real-world sensor inputs. |
Primary Objective | Craft minimal, often imperceptible, perturbations to cause misclassification. | Infer decision boundaries via queries to craft effective inputs. | Create robust, visible alterations that survive camera capture and environmental variance. |
Perturbation Constraint | Minimal L_p norm (e.g., L_∞ < 0.03). | Query budget and success rate are primary constraints. | Must be physically realizable, robust to viewpoint/lighting changes. |
Example Methods | Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), Carlini & Wagner. | Query-based optimization, transfer attacks from surrogate models. | Patch attacks, adversarial stickers, object camouflage. |
Defense Difficulty | High; defenses often circumvented by adaptive attacks. | Moderate; rate-limiting and input filtering can be effective. | Very High; requires hardening of entire perception pipeline. |
Evaluation Metric | Adversarial success rate, perturbation magnitude (epsilon). | Number of queries to success, attack transferability rate. | Attack success rate under varying real-world conditions (angle, distance). |
Common Target Models | All differentiable models (CNNs, Transformers) in development. | Production APIs, commercial ML services. | Autonomous vehicle perception, facial recognition, robotic vision. |
Attack Stage | Primarily inference-time (evasion). | Inference-time (evasion) or training-time (poisoning via queries). | Inference-time on physical systems. |
Common Adversarial Attack Methods
Adversarial attacks are systematic methods for crafting inputs that exploit model vulnerabilities. This section details the primary techniques used to generate adversarial examples, categorized by their underlying mechanism and threat model.
Fast Gradient Sign Method (FGSM)
The Fast Gradient Sign Method is a single-step, gradient-based white-box attack. It generates an adversarial example by perturbing the original input in the direction of the gradient of the loss function with respect to the input, scaled by a small epsilon (ε).
- Mechanism:
x_adv = x + ε * sign(∇_x J(θ, x, y)) - Key Trait: Computationally efficient, often used as a baseline for robustness evaluation.
- Limitation: As a single-step method, it can be less potent than iterative attacks but is highly effective against undefended models.
Projected Gradient Descent (PGD)
Projected Gradient Descent is a strong, iterative white-box attack considered a cornerstone for evaluating and performing adversarial training. It applies FGSM multiple times with a small step size (α), projecting the perturbed example back into a valid ε-bounded norm ball after each iteration.
- Mechanism: Iterative application:
x_{t+1} = Proj_{x+ε} (x_t + α * sign(∇_x J(θ, x_t, y))) - Key Trait: Effectively explores the loss landscape, finding strong adversarial examples within the constraint.
- Use Case: The standard benchmark for measuring adversarial robustness and the primary attack used in adversarial training frameworks.
Carlini & Wagner (C&W) Attack
The Carlini & Wagner attack is an optimization-based white-box attack formulated to find the minimal adversarial perturbation under a chosen norm constraint (e.g., L2, L∞). It directly optimizes a custom loss function designed to balance perturbation size with misclassification confidence.
- Key Trait: Generates highly imperceptible adversarial examples, often used to break defensive techniques like gradient masking and defensive distillation.
- Method: Solves:
minimize ||δ||_p + c * f(x+δ)subject tox+δ ∈ [0,1]^n. - Challenge: More computationally intensive than gradient-sign methods but is a powerful tool for stress-testing defenses.
DeepFool
DeepFool is an efficient, iterative white-box attack that approximates the classifier's decision boundary as a linear hyperplane at each step. It computes the minimal perturbation required to cross this linearized boundary, iterating until the input is misclassified.
- Key Trait: Designed to find perturbations with smaller norms than one-step attacks like FGSM, providing a better estimate of a model's robustness margin.
- Mechanism: Iteratively linearizes the classifier and pushes the point orthogonally towards the nearest decision boundary.
- Output: Often produces adversarial examples with very low perceived distortion.
Universal Adversarial Perturbation
A Universal Adversarial Perturbation is a single, input-agnostic noise vector that, when added to most natural images from a data distribution, causes a model to misclassify them with high probability. This reveals systemic, data-independent vulnerabilities in the model's decision function.
- Key Trait: Demonstrates that vulnerabilities are not unique to specific inputs but are embedded in the model's geometry.
- Generation: Typically crafted by aggregating perturbations required to fool the model on a large set of training images.
- Implication: Raises significant security concerns for deployed systems, as a single patch or filter could cause widespread failures.
Query-Based Black-Box Attacks
Query-based attacks are a class of black-box attacks where the adversary has no internal model knowledge, only the ability to query the target model and receive outputs (e.g., predicted class or confidence scores). Strategies include:
- Score-based: Uses estimated gradients from finite differences or natural evolution strategies.
- Decision-based: Relies only on the final predicted class label, using boundary search methods like the Boundary Attack.
- Key Challenge: Must be highly query-efficient to remain practical against rate-limited APIs.
- Real-world Relevance: Most closely models the threat from external attackers probing a deployed model service.
Frequently Asked Questions
A glossary of key terms and concepts related to adversarial attacks, which are systematic attempts to cause machine learning models to fail through carefully crafted inputs.
An adversarial attack is a deliberate attempt to cause a machine learning model to make a mistake by feeding it a specially crafted input, known as an adversarial example. These inputs are created by applying small, often imperceptible, perturbations to a normal input (like an image or text) that exploit the model's learned decision boundaries. The goal is to manipulate the model's output—such as causing an image classifier to mislabel a panda as a gibbon—while the input appears unchanged to a human observer. This exposes fundamental vulnerabilities in how models generalize from their training data.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
To fully understand adversarial attacks, it is essential to grasp the ecosystem of related concepts, including attack methodologies, defensive strategies, and evaluation metrics that define the field of machine learning security.
Adversarial Example
An adversarial example is the specific, malicious input crafted during an adversarial attack. It is a data point (e.g., an image, text sample, or audio clip) that has been perturbed in a way often imperceptible to humans but causes a machine learning model to produce an incorrect output with high confidence. The perturbation is typically constrained by a norm (like L∞) to ensure the input appears legitimate.
- Key Characteristic: The perturbation is optimized to be minimal yet effective.
- Example: A panda image altered by adding a subtle, noise-like pattern that causes an image classifier to predict it as a gibbon.
Adversarial Robustness
Adversarial robustness is the property of a machine learning model that quantifies its resilience to adversarial examples. It is measured by a model's robust accuracy—its classification accuracy on a test set containing adversarial inputs. A model with high robustness maintains correct predictions even under attack.
- Contrast with Standard Accuracy: A model can have high standard accuracy on clean data but low robust accuracy.
- Primary Goal of Defenses: Techniques like adversarial training aim to maximize this property.
- Evaluation: Typically assessed using benchmark attacks like Projected Gradient Descent (PGD).
White-Box vs. Black-Box Attack
These terms classify attacks based on the attacker's knowledge of the target model.
- White-Box Attack: The attacker has full access to the model's architecture, parameters (weights), and gradients. This allows for precise, gradient-based optimization of adversarial examples (e.g., FGSM, PGD, C&W). It represents a worst-case security assessment.
- Black-Box Attack: The attacker has no internal model knowledge, interacting only via the model's input-output API. Attacks are typically query-based, using the model's predictions to estimate gradients or perform evolutionary searches. Transfer attacks, where an example crafted on a surrogate model fools the target, are a common black-box strategy.
Evasion Attack vs. Poisoning Attack
These are the two primary temporal categories of adversarial attacks, distinguished by when they occur in the model lifecycle.
- Evasion Attack (Inference-Time Attack): The attack occurs after the model is deployed. The adversary crafts a malicious input to 'evade' correct classification during inference. This is the most common type, encompassing most image and text adversarial examples.
- Poisoning Attack (Training-Time Attack): The attack occurs during the model's training phase. The adversary injects corrupted or malicious data into the training set to compromise the learned model. A backdoor attack is a subtype where the model behaves normally on clean data but misbehaves when a specific trigger pattern is present.
Targeted vs. Untargeted Attack
This distinction defines the adversary's objective regarding the model's incorrect output.
- Untargeted Attack: The goal is to cause any misclassification. The attack is successful if the model's output is incorrect, regardless of what the new class is. This is generally easier to execute.
- Targeted Attack: The goal is to cause the model to predict a specific, attacker-chosen incorrect class. For example, causing a stop sign to be classified as a speed limit sign. This is a more constrained and typically more difficult objective, requiring stronger perturbations.
Adversarial Training
Adversarial training is the foremost defensive technique for improving model robustness. It involves augmenting the training dataset with adversarial examples generated on-the-fly during the training process. The model learns to classify both clean and adversarially perturbed examples correctly.
- Process: In each training iteration, adversarial examples are crafted for the current mini-batch (often using PGD), and the model is updated to minimize loss on these hard examples.
- Trade-off: Often improves robust accuracy at a slight cost to standard accuracy on clean data.
- Foundation: Considered a baseline and essential component of modern robust model development.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us