An adversarial example is an input to a machine learning model that has been subtly perturbed to cause the model to output an incorrect prediction with high confidence. These perturbations are often imperceptible to humans but exploit the model's learned decision boundaries, revealing a fundamental brittleness in how neural networks generalize. This phenomenon is a core focus of adversarial robustness research and is central to red-teaming and security evaluations.
Glossary
Adversarial Example

What is an Adversarial Example?
An adversarial example is a critical concept in machine learning security, representing a deliberately crafted input designed to exploit model vulnerabilities.
Adversarial examples are generated through specific adversarial attack algorithms, such as the Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD), which optimize perturbations to maximize model error. They are categorized by attacker knowledge as white-box or black-box attacks and by goal as targeted or untargeted attacks. Defending against them often involves adversarial training, where such examples are included in the training process to improve model resilience.
Key Characteristics of Adversarial Examples
Adversarial examples are not random noise; they are inputs crafted with specific, measurable properties to exploit the geometric and statistical vulnerabilities of machine learning models. Understanding these core characteristics is essential for developing robust defenses.
Imperceptible Perturbation
The defining feature of an adversarial example is a minimal perturbation—often imperceptible to a human observer—that causes a model to fail. This is quantified using distance metrics like L-infinity norm (max pixel change) or L2 norm (Euclidean distance). For instance, changing pixel values by less than 8/255 on an image normalized to [0,1] can reliably flip a classifier's prediction. This characteristic demonstrates that model decision boundaries are highly sensitive in directions irrelevant to human perception.
High-Confidence Misclassification
Adversarial examples cause models to make incorrect predictions with unjustifiably high confidence. A model might classify a panda image as a "gibbon" with 99.9% confidence after a tiny perturbation. This highlights a critical failure mode: the model's internal confidence calibration is broken. The output probability distribution becomes sharply peaked on the wrong class, indicating the perturbation has moved the input far across the model's learned decision boundary in its internal representation space.
Transferability
A single adversarial example crafted for one model often transfers to other models, even those with different architectures or trained on different datasets. This occurs because different models learn similar decision boundaries for the same task. Transferability enables practical black-box attacks, where an attacker uses a surrogate model to generate examples effective against an unknown target model. It is a fundamental challenge for security, as patching one model does not guarantee ecosystem-wide safety.
Non-Random, Optimized Structure
Adversarial perturbations are not random noise. They are strategically constructed, often via optimization algorithms that exploit model gradients:
- Fast Gradient Sign Method (FGSM): Perturbs input along the sign of the loss gradient.
- Projected Gradient Descent (PGD): An iterative, stronger variant of FGSM.
- Carlini & Wagner (C&W): Solves an optimization problem for minimal perturbation. These methods create perturbations aligned with the model's loss landscape, pushing the input across a decision boundary along the path of least resistance.
Task and Modality Agnosticism
While most famous in image classification, adversarial examples exist across all modalities and tasks:
- Text: Changing a few characters or synonyms can flip sentiment analysis.
- Audio: Inaudible perturbations can cause speech-to-text transcribers to output malicious commands.
- Reinforcement Learning: Perturbations to sensor inputs can cause autonomous agents to crash.
- Large Language Models: Adversarial suffixes or jailbreak prompts can bypass safety filters. This universality indicates the vulnerability is inherent to high-dimensional, differentiable function approximators like neural networks.
Exploitation of Linear Regions
A primary cause of vulnerability, especially in high-dimensional spaces, is the piecewise linear nature of networks with ReLU activations. Adversarial examples exploit the fact that models are often locally linear. A small step in the direction of the gradient can cause a large change in output. This characteristic is central to the explanation offered by the "linearity hypothesis," which argues that the sheer dimensionality of inputs makes linear perturbations sufficient for causing misclassification, even in nominally non-linear models.
How Are Adversarial Examples Created?
Adversarial examples are created through systematic optimization techniques that apply minimal, often imperceptible, perturbations to a legitimate input to cause a machine learning model to produce an incorrect output.
Creation methods are broadly categorized by the attacker's knowledge. In a white-box attack, the adversary has full access to the model's architecture, parameters, and gradients. Techniques like the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) use the model's gradient to efficiently craft perturbations that maximize prediction error. These are optimization problems, minimizing perturbation size while ensuring misclassification.
In a black-box attack, the attacker has no internal model access and relies on query-based strategies. By observing input-output pairs, they can build a surrogate model to craft examples, or use transfer attacks where examples generated for one model fool another. The core mechanism exploits the model's high-dimensional linearity or non-robust features, not human-recognizable patterns.
Common Attack Methods for Generation
Adversarial examples are generated using specific attack algorithms. These methods vary in their assumptions about the attacker's knowledge, the type of perturbation allowed, and the attack's objective.
Adversarial Examples Across Domains
A comparison of adversarial example characteristics, attack methods, and defensive considerations across different data modalities.
| Feature / Consideration | Computer Vision (Images) | Natural Language Processing (Text) | Audio & Speech |
|---|---|---|---|
Primary Perturbation Type | Additive pixel noise (L_p norm bounded) | Token substitutions, insertions, deletions | Additive inaudible noise or temporal perturbations |
Human Perceptibility Threshold | Low (subtle pixel changes often invisible) | High (semantic coherence must be preserved) | Medium (perturbations must remain inaudible) |
Common Attack Method Archetype | Gradient-based (e.g., FGSM, PGD) | Score-based or decision-based black-box | Gradient-based or psychoacoustic hiding |
Key Defense Strategy | Adversarial training, input denoising | Semantic consistency checks, certified robustness | Adversarial training, signal purification |
Physical-World Feasibility | |||
Transferability Between Models | |||
Primary Real-World Risk | Autonomous vehicle perception, facial recognition | Content moderation bypass, phishing detection evasion | Voice command hijacking, biometric spoofing |
Benchmark Dataset Example | ImageNet-Adv, CIFAR-10-C | AdvGLUE, TextAttack | AudioMNIST, LibriSpeech-Adv |
Frequently Asked Questions
An adversarial example is an input to a machine learning model that has been subtly perturbed to cause the model to output an incorrect prediction with high confidence. This FAQ addresses common questions about their mechanics, creation, and implications for secure AI development.
An adversarial example is a carefully crafted input, often indistinguishable from a normal input to a human observer, that causes a machine learning model to make a high-confidence mistake. These inputs exploit the model's sensitivity to minute, often imperceptible, perturbations in high-dimensional space, revealing a fundamental gap between human and machine perception. They are a primary tool for adversarial testing and a critical concept in preemptive algorithmic cybersecurity.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Adversarial examples exist within a broader ecosystem of security threats and defensive methodologies. These related concepts define the attack vectors, defense strategies, and evaluation metrics critical to building robust AI systems.
Adversarial Attack
An adversarial attack is the deliberate act of crafting an input to cause a machine learning model to fail. It is the offensive methodology, of which an adversarial example is the resulting artifact. Attacks are categorized by the attacker's knowledge and goal:
- White-box: Attacker has full model access (architecture, parameters).
- Black-box: Attacker only observes inputs and outputs.
- Targeted: Forces a specific, incorrect output.
- Untargeted: Causes any incorrect output.
Adversarial Robustness
Adversarial robustness is the quantitative property of a model that measures its resistance to adversarial examples. It is the primary security objective in adversarial machine learning. Robustness is typically measured as robust accuracy—the model's accuracy on a test set containing adversarial examples—under a specific threat model defining the allowed perturbation magnitude (e.g., L-infinity norm ≤ 8/255 for images). A model with high standard accuracy but low robust accuracy is considered brittle and vulnerable.
Adversarial Training
Adversarial training is the primary defensive technique for improving model robustness. It involves augmenting the standard training process by generating and including adversarial examples in the training dataset. The model learns to classify these perturbed examples correctly, effectively regularizing its decision boundaries. A common implementation uses the Projected Gradient Descent (PGG) attack to generate strong training-time adversaries. While computationally expensive, it remains the most empirically effective defense against L_p-norm bounded attacks.
Fast Gradient Sign Method (FGSM)
The Fast Gradient Sign Method is a simple, efficient white-box attack algorithm foundational to the field. Introduced by Goodfellow et al. in 2014, it generates an adversarial example by taking a single step in the direction of the gradient of the loss function with respect to the input:
x_adv = x + ε * sign(∇_x J(θ, x, y))
Where ε is a small scalar controlling perturbation magnitude. Its speed makes it useful for adversarial training, though stronger iterative attacks like PGD often provide a more rigorous robustness evaluation.
Transfer Attack
A transfer attack exploits the transferability property of adversarial examples, where an example crafted to fool one model (a surrogate) also fools a different, target model. This is the core mechanism enabling practical black-box attacks, as the attacker can train a local surrogate model and craft examples against it. Transferability occurs because different models often learn similar decision boundaries for the same task. Attack success rates increase when the surrogate and target models share architectural similarities or are trained on similar data distributions.
Physical Adversarial Attack
A physical adversarial attack moves the threat from digital pixel space to the physical world. Here, perturbations are applied to real-world objects (e.g., a stop sign, a person's glasses) to fool computer vision systems. These attacks must account for environmental variables like lighting, angle, and distance. A famous example is the patch attack, where a visible sticker causes misclassification. Defending against physical attacks is significantly harder, requiring robustness to a broader, less constrained set of transformations beyond simple L_p-norm bounds.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us