A targeted attack is an adversarial attack where the adversary crafts an input, known as an adversarial example, to cause a machine learning model to output a specific, incorrect class chosen by the attacker. This contrasts with an untargeted attack, which only seeks any misclassification. The attack is considered successful only if the model's prediction matches the attacker's designated target label, making it a more precise and often more challenging objective than simply causing an error.
Glossary
Targeted Attack

What is a Targeted Attack?
A targeted adversarial attack is a specific type of security exploit against machine learning models where the attacker aims to produce a pre-selected, incorrect output.
Executing a targeted attack typically requires finding a minimal perturbation to a legitimate input that moves it across the model's decision boundary into the region of the target class. Common methods include optimization-based approaches like the Carlini & Wagner (C&W) attack. Defenses against such attacks include adversarial training with targeted examples and evaluating a model's robust accuracy. Targeted attacks are a core focus of red-teaming exercises to probe model security.
Key Characteristics of a Targeted Attack
A targeted adversarial attack is defined by the adversary's specific goal of causing a model to output a particular, predetermined incorrect class. This contrasts with untargeted attacks, which aim for any misclassification. The following characteristics distinguish its methodology and objectives.
Specificity of Objective
The defining feature of a targeted attack is the adversary's precise goal. Instead of causing any misclassification, the attacker aims to steer the model toward a specific, incorrect output class. For example, causing an autonomous vehicle's vision system to classify a stop sign as a 'yield' sign, or a financial fraud detector to label a fraudulent transaction as 'legitimate'. This requires more sophisticated perturbation crafting than untargeted attacks, as the adversarial example must not only cross a decision boundary but land within a specific, often distant, region of the output space.
Higher Attack Difficulty
Targeted attacks are generally more computationally complex and require larger perturbations than their untargeted counterparts. This is because the optimization problem is more constrained: the adversarial example must maximize the probability of the target class while simultaneously minimizing the probability of the true class and all other classes. Algorithms like the Carlini & Wagner (C&W) attack are explicitly designed for this purpose, formulating it as a minimization problem with a custom loss function that penalizes any output other than the desired target.
Formal Optimization Problem
Targeted attacks are often framed as a constrained optimization. The adversary seeks a minimal perturbation δ added to a clean input x such that:
f(x + δ) = y_target(model predicts the target class)||δ||_p ≤ ε(perturbation is small under a p-norm, e.g., L₂ or L_∞) This formulation is central to white-box attacks like C&W and Projected Gradient Descent (PGD) when configured for a target. The objective function directly encodes the distance to the target class's decision region, making the attack's success measurable and reproducible for benchmarking adversarial robustness.
Use in Security Evaluation
In red-teaming and security audits, targeted attacks are a critical stress test. They simulate a worst-case scenario where an adversary has a concrete, harmful objective. Successfully executing a targeted attack against a model reveals deeper vulnerabilities than an untargeted one. Evaluating a model's robust accuracy against targeted attacks provides a stringent measure of its reliability in high-stakes applications like content moderation (forcing a harmful post to be classified as 'safe') or medical diagnosis (forcing a malignant scan to be classified as 'benign').
Connection to Backdoor Attacks
Targeted attacks share conceptual ground with backdoor attacks, but operate at different phases. A backdoor attack is a poisoning attack executed during training, where a model learns to associate a trigger pattern with a specific target label. At inference, any input containing the trigger causes the targeted misclassification. In contrast, a standard targeted attack is an evasion attack at inference time on a clean model. Both aim for a specific incorrect output, making the analysis of a model's susceptibility to targeted perturbations a key part of defending against potential backdoors.
Lower Transferability
Adversarial examples crafted for a targeted attack on one model are generally less transferable to other models than those from untargeted attacks. This is because the precise perturbations needed to reach a specific class in one model's high-dimensional feature space are often highly specific to that model's unique weight geometry. While transfer attacks are possible, targeted transferability is a weaker phenomenon, making black-box targeted attacks more challenging and often requiring extensive query-based attack strategies to approximate the target model's decision boundaries.
How Does a Targeted Attack Work?
A targeted adversarial attack is a security exploit where an adversary crafts a malicious input to force a machine learning model to produce a specific, predetermined incorrect output.
A targeted attack is an evasion attack where the adversary's objective is not merely to cause a mistake, but to induce a specific misclassification. For example, an attacker might modify an image of a stop sign so an autonomous vehicle's vision system classifies it as a speed limit sign. This is distinct from an untargeted attack, which only seeks any incorrect output. The attack crafts an adversarial example by applying a subtle, often human-imperceptible, perturbation to a legitimate input, calculated to cross the model's decision boundary toward the chosen target class.
Execution typically requires calculating the gradient of the model's loss with respect to the input, guiding the perturbation. In a white-box attack, the attacker has full access to the model's architecture and parameters to compute this precisely. Black-box attacks use iterative query-based methods or transfer attacks from a surrogate model. Defenses include adversarial training with targeted examples and rigorous red-teaming to measure robust accuracy against such specified failures.
Common Targeted Attack Methods
Targeted attacks are distinguished by the adversary's goal: to force a specific, predetermined misclassification. These methods craft inputs to exploit model vulnerabilities with surgical precision.
Targeted Attack vs. Untargeted Attack
A comparison of two fundamental adversarial attack strategies, distinguished by the specificity of the attacker's goal.
| Feature | Targeted Attack | Untargeted Attack |
|---|---|---|
Primary Objective | Cause the model to output a specific, attacker-chosen incorrect class. | Cause the model to output any incorrect class. |
Attack Formulation | Minimizes loss for the target class while maximizing loss for the true class. | Maximizes loss for the true class. |
Adversarial Constraint | More constrained; must find a perturbation that moves the input to a precise region of the output space. | Less constrained; any perturbation that crosses a decision boundary is sufficient. |
Typical Difficulty | Generally more difficult and computationally intensive to execute successfully. | Generally easier and less computationally intensive to execute. |
Perturbation Magnitude | Often requires larger perturbations to reach a specific, distant target class. | Can often succeed with smaller perturbations to reach the nearest decision boundary. |
Success Metric | Targeted success rate: the percentage of inputs misclassified as the specific target class. | Untargeted success rate (attack success rate): the percentage of inputs that are misclassified. |
Common Use Cases | Testing for specific, high-consequence failures (e.g., misclassifying a 'stop' sign as a 'speed limit' sign). Red-teaming for precise vulnerabilities. | General robustness evaluation. Stress-testing a model's overall decision boundaries. |
Transferability | Lower transferability between models, as the target class decision boundaries are model-specific. | Higher transferability, as decision boundaries for the original class may be similarly vulnerable across models. |
Frequently Asked Questions
A targeted adversarial attack aims to cause a machine learning model to output a specific, incorrect class chosen by the adversary. This glossary entry addresses common technical questions about how these attacks work, their mechanisms, and defensive strategies.
A targeted adversarial attack is a security exploit where an adversary crafts a malicious input, known as an adversarial example, to deliberately cause a machine learning model to misclassify it as a specific, incorrect label chosen by the attacker. Unlike an untargeted attack, which only seeks any incorrect output, a targeted attack aims for a precise, wrong prediction, such as forcing a facial recognition system to identify a person as someone else. This requires more sophisticated perturbation methods, often involving optimization to minimize input distortion while maximizing the probability of the target class.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Targeted attacks are part of a broader adversarial testing landscape. Understanding these related concepts is essential for building robust AI systems.
Untargeted Attack
An adversarial attack where the adversary's sole objective is to cause the model to output any incorrect prediction, without specifying a particular wrong class. This is a broader, less constrained goal than a targeted attack.
- Primary Goal: Cause misclassification, not control the specific error.
- Evaluation Use: Often used as a baseline for measuring a model's general vulnerability to perturbation.
- Example: Modifying an image of a cat just enough so the model calls it a 'dog', 'car', or any other label besides 'cat'.
Evasion Attack
An adversarial attack executed at inference time, where a malicious input is crafted to bypass a deployed model's detection or classification. Both targeted and untargeted attacks are subtypes of evasion attacks.
- Phase of Operation: Post-deployment, during model inference.
- Contrast with Poisoning: Differs from data poisoning, which occurs during the training phase.
- Real-World Context: The most direct threat to production AI systems, such as fooling a content filter or biometric scanner.
Carlini & Wagner Attack (C&W)
A powerful, optimization-based white-box attack method designed to generate adversarial examples with minimal perturbation. It is particularly effective for executing precise targeted attacks.
- Methodology: Formulates attack generation as an optimization problem, minimizing perturbation subject to the constraint of causing the target misclassification.
- Key Use Case: A standard benchmark for evaluating the strength of adversarial defenses and distillation techniques.
- Precision: Excels at finding the smallest possible changes needed to achieve a targeted misclassification.
Projected Gradient Descent (PGD)
A strong, iterative white-box attack and the cornerstone for modern adversarial training. PGD applies the Fast Gradient Sign Method (FGSM) multiple times with a small step size, projecting perturbations back to a valid norm ball after each step.
- Strength: Considered a universal first-order adversary; a model robust to PGD is often robust to many other attacks.
- Role in Training: Adversarial training using PGD-generated examples is a leading defense technique.
- Versatility: Can be easily configured for both targeted and untargeted attack objectives.
Transfer Attack
An attack where an adversarial example crafted against one model (the surrogate model) is also effective against a different, potentially black-box, target model. This property enables practical black-box targeted attacks.
- Mechanism: Relies on the transferability of adversarial perturbations between models trained on similar data.
- Black-Box Application: An adversary can train their own surrogate model, craft a targeted attack against it, and often have it transfer to the unknown target.
- Security Implication: Means that even without internal model access, targeted attacks are feasible.
Robust Accuracy
A model's classification accuracy measured on a test set that includes adversarial examples. It provides a more comprehensive measure of real-world reliability than standard accuracy.
- Key Metric: The primary benchmark for evaluating a model's defense against evasion attacks, including targeted ones.
- Calculation: Typically reported as accuracy under attack from a specific threat model (e.g., "PGD robust accuracy").
- Trade-off: Often exists between standard accuracy and robust accuracy; improving one can reduce the other.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us