Glossary

Targeted Attack

A targeted adversarial attack is a security exploit where an adversary crafts an input to force a machine learning model to output a specific, incorrect prediction chosen by the attacker.

Get in touch Learn more

ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.

ADVERSARIAL TESTING

What is a Targeted Attack?

A targeted adversarial attack is a specific type of security exploit against machine learning models where the attacker aims to produce a pre-selected, incorrect output.

A targeted attack is an adversarial attack where the adversary crafts an input, known as an adversarial example, to cause a machine learning model to output a specific, incorrect class chosen by the attacker. This contrasts with an untargeted attack, which only seeks any misclassification. The attack is considered successful only if the model's prediction matches the attacker's designated target label, making it a more precise and often more challenging objective than simply causing an error.

Executing a targeted attack typically requires finding a minimal perturbation to a legitimate input that moves it across the model's decision boundary into the region of the target class. Common methods include optimization-based approaches like the Carlini & Wagner (C&W) attack. Defenses against such attacks include adversarial training with targeted examples and evaluating a model's robust accuracy. Targeted attacks are a core focus of red-teaming exercises to probe model security.

ADVERSARIAL TESTING

Key Characteristics of a Targeted Attack

A targeted adversarial attack is defined by the adversary's specific goal of causing a model to output a particular, predetermined incorrect class. This contrasts with untargeted attacks, which aim for any misclassification. The following characteristics distinguish its methodology and objectives.

Specificity of Objective

The defining feature of a targeted attack is the adversary's precise goal. Instead of causing any misclassification, the attacker aims to steer the model toward a specific, incorrect output class. For example, causing an autonomous vehicle's vision system to classify a stop sign as a 'yield' sign, or a financial fraud detector to label a fraudulent transaction as 'legitimate'. This requires more sophisticated perturbation crafting than untargeted attacks, as the adversarial example must not only cross a decision boundary but land within a specific, often distant, region of the output space.

Higher Attack Difficulty

Targeted attacks are generally more computationally complex and require larger perturbations than their untargeted counterparts. This is because the optimization problem is more constrained: the adversarial example must maximize the probability of the target class while simultaneously minimizing the probability of the true class and all other classes. Algorithms like the Carlini & Wagner (C&W) attack are explicitly designed for this purpose, formulating it as a minimization problem with a custom loss function that penalizes any output other than the desired target.

Formal Optimization Problem

Targeted attacks are often framed as a constrained optimization. The adversary seeks a minimal perturbation δ added to a clean input x such that:

f(x + δ) = y_target (model predicts the target class)
||δ||_p ≤ ε (perturbation is small under a p-norm, e.g., L₂ or L_∞) This formulation is central to white-box attacks like C&W and Projected Gradient Descent (PGD) when configured for a target. The objective function directly encodes the distance to the target class's decision region, making the attack's success measurable and reproducible for benchmarking adversarial robustness.

Use in Security Evaluation

In red-teaming and security audits, targeted attacks are a critical stress test. They simulate a worst-case scenario where an adversary has a concrete, harmful objective. Successfully executing a targeted attack against a model reveals deeper vulnerabilities than an untargeted one. Evaluating a model's robust accuracy against targeted attacks provides a stringent measure of its reliability in high-stakes applications like content moderation (forcing a harmful post to be classified as 'safe') or medical diagnosis (forcing a malignant scan to be classified as 'benign').

Connection to Backdoor Attacks

Targeted attacks share conceptual ground with backdoor attacks, but operate at different phases. A backdoor attack is a poisoning attack executed during training, where a model learns to associate a trigger pattern with a specific target label. At inference, any input containing the trigger causes the targeted misclassification. In contrast, a standard targeted attack is an evasion attack at inference time on a clean model. Both aim for a specific incorrect output, making the analysis of a model's susceptibility to targeted perturbations a key part of defending against potential backdoors.

Lower Transferability

Adversarial examples crafted for a targeted attack on one model are generally less transferable to other models than those from untargeted attacks. This is because the precise perturbations needed to reach a specific class in one model's high-dimensional feature space are often highly specific to that model's unique weight geometry. While transfer attacks are possible, targeted transferability is a weaker phenomenon, making black-box targeted attacks more challenging and often requiring extensive query-based attack strategies to approximate the target model's decision boundaries.

ADVERSARIAL TESTING

How Does a Targeted Attack Work?

A targeted adversarial attack is a security exploit where an adversary crafts a malicious input to force a machine learning model to produce a specific, predetermined incorrect output.

A targeted attack is an evasion attack where the adversary's objective is not merely to cause a mistake, but to induce a specific misclassification. For example, an attacker might modify an image of a stop sign so an autonomous vehicle's vision system classifies it as a speed limit sign. This is distinct from an untargeted attack, which only seeks any incorrect output. The attack crafts an adversarial example by applying a subtle, often human-imperceptible, perturbation to a legitimate input, calculated to cross the model's decision boundary toward the chosen target class.

Execution typically requires calculating the gradient of the model's loss with respect to the input, guiding the perturbation. In a white-box attack, the attacker has full access to the model's architecture and parameters to compute this precisely. Black-box attacks use iterative query-based methods or transfer attacks from a surrogate model. Defenses include adversarial training with targeted examples and rigorous red-teaming to measure robust accuracy against such specified failures.

ADVERSARIAL TESTING

Common Targeted Attack Methods

Targeted attacks are distinguished by the adversary's goal: to force a specific, predetermined misclassification. These methods craft inputs to exploit model vulnerabilities with surgical precision.

Carlini & Wagner (C&W) Attack

An optimization-based white-box attack designed to find the minimal perturbation required to cause a targeted misclassification. It formulates the attack as a constrained optimization problem, often using a specialized loss function to balance perturbation size with attack success. It is considered a strong benchmark for evaluating defenses like defensive distillation.

Key Mechanism: Solves minimize ||δ||_p + c⋅f(x+δ) where f is a function that is ≤0 when the attack succeeds.
Primary Use: Breaking gradient-masking defenses and establishing lower bounds on robust accuracy.

EXPLORE

Projected Gradient Descent (PGD)

A strong, iterative white-box attack and the foundational method for adversarial training. It performs multiple steps of the Fast Gradient Sign Method (FGSM) with a small step size, projecting the perturbed example back into a valid ε-norm ball after each iteration. This makes it a powerful universal first-order adversary.

Key Mechanism: x_{t+1} = Proj_{x+ε}(x_t + α⋅sign(∇_x J(θ, x_t, y_target)))
Primary Use: The standard attack for training robust models (PGD-based adversarial training).

EXPLORE

Jacobian-based Saliency Map Attack (JSMA)

A feature-space, white-box attack that constructs adversarial examples by iteratively modifying the most influential input features (e.g., pixels) to push the model towards a target class. It uses the model's Jacobian matrix to compute a saliency map, identifying which features to perturb.

Key Mechanism: Computes a saliency score S(x, t)[i] for each feature i based on forward derivatives toward the target class t.
Primary Use: Generating sparse perturbations (changing few features) rather than small-norm perturbations.

EXPLORE

HopSkipJumpAttack

A query-efficient, decision-based black-box attack that requires only the final model decision (the predicted class label), not confidence scores. It uses a binary search and gradient estimation approach to find a targeted adversarial example with minimal perturbations, starting from a point already in the target class.

Key Mechanism: Iteratively performs a boundary search to estimate the direction to the decision boundary, then a step search to reduce perturbation.
Primary Use: Attacking production models where only hard labels are available, simulating a realistic black-box threat.

EXPLORE

Adversarial Patch Attack

A physical, targeted attack where a visible, often semantically meaningful patch is applied to an object to cause a specific misclassification. Unlike digital perturbations, the patch is input-agnostic and designed to be effective across many spatial locations and backgrounds.

Key Mechanism: Optimizes a single patch p to maximize the probability of a target class y_target when overlaid on any input x: argmax_p E_{x,location}[log P(y_target | A(x, p, location))].
Primary Use: Evaluating the physical-world robustness of computer vision systems (e.g., causing an autonomous vehicle to misread a stop sign as a speed limit sign).

EXPLORE

Backdoor Attack (Trojan Attack)

A training-time, targeted poisoning attack where an adversary implants a trigger pattern into the model during training. The model behaves normally on clean inputs but produces a specific, attacker-chosen output when the trigger is present. The target is the malicious output chosen by the attacker.

Key Mechanism: Poison the training set with examples (x + trigger, y_target). The model learns to associate the trigger with y_target.
Primary Use: Compromising supply chain security (e.g., attacking models trained on third-party data or using pre-trained weights).

EXPLORE

ADVERSARIAL ATTACK TYPES

Targeted Attack vs. Untargeted Attack

A comparison of two fundamental adversarial attack strategies, distinguished by the specificity of the attacker's goal.

Feature	Targeted Attack	Untargeted Attack
Primary Objective	Cause the model to output a specific, attacker-chosen incorrect class.	Cause the model to output any incorrect class.
Attack Formulation	Minimizes loss for the target class while maximizing loss for the true class.	Maximizes loss for the true class.
Adversarial Constraint	More constrained; must find a perturbation that moves the input to a precise region of the output space.	Less constrained; any perturbation that crosses a decision boundary is sufficient.
Typical Difficulty	Generally more difficult and computationally intensive to execute successfully.	Generally easier and less computationally intensive to execute.
Perturbation Magnitude	Often requires larger perturbations to reach a specific, distant target class.	Can often succeed with smaller perturbations to reach the nearest decision boundary.
Success Metric	Targeted success rate: the percentage of inputs misclassified as the specific target class.	Untargeted success rate (attack success rate): the percentage of inputs that are misclassified.
Common Use Cases	Testing for specific, high-consequence failures (e.g., misclassifying a 'stop' sign as a 'speed limit' sign). Red-teaming for precise vulnerabilities.	General robustness evaluation. Stress-testing a model's overall decision boundaries.
Transferability	Lower transferability between models, as the target class decision boundaries are model-specific.	Higher transferability, as decision boundaries for the original class may be similarly vulnerable across models.

TARGETED ATTACK

Frequently Asked Questions

A targeted adversarial attack aims to cause a machine learning model to output a specific, incorrect class chosen by the adversary. This glossary entry addresses common technical questions about how these attacks work, their mechanisms, and defensive strategies.

A targeted adversarial attack is a security exploit where an adversary crafts a malicious input, known as an adversarial example, to deliberately cause a machine learning model to misclassify it as a specific, incorrect label chosen by the attacker. Unlike an untargeted attack, which only seeks any incorrect output, a targeted attack aims for a precise, wrong prediction, such as forcing a facial recognition system to identify a person as someone else. This requires more sophisticated perturbation methods, often involving optimization to minimize input distortion while maximizing the probability of the target class.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

ADVERSARIAL TESTING

Related Terms

Targeted attacks are part of a broader adversarial testing landscape. Understanding these related concepts is essential for building robust AI systems.

Untargeted Attack

An adversarial attack where the adversary's sole objective is to cause the model to output any incorrect prediction, without specifying a particular wrong class. This is a broader, less constrained goal than a targeted attack.

Primary Goal: Cause misclassification, not control the specific error.
Evaluation Use: Often used as a baseline for measuring a model's general vulnerability to perturbation.
Example: Modifying an image of a cat just enough so the model calls it a 'dog', 'car', or any other label besides 'cat'.

Evasion Attack

An adversarial attack executed at inference time, where a malicious input is crafted to bypass a deployed model's detection or classification. Both targeted and untargeted attacks are subtypes of evasion attacks.

Phase of Operation: Post-deployment, during model inference.
Contrast with Poisoning: Differs from data poisoning, which occurs during the training phase.
Real-World Context: The most direct threat to production AI systems, such as fooling a content filter or biometric scanner.

Carlini & Wagner Attack (C&W)

A powerful, optimization-based white-box attack method designed to generate adversarial examples with minimal perturbation. It is particularly effective for executing precise targeted attacks.

Methodology: Formulates attack generation as an optimization problem, minimizing perturbation subject to the constraint of causing the target misclassification.
Key Use Case: A standard benchmark for evaluating the strength of adversarial defenses and distillation techniques.
Precision: Excels at finding the smallest possible changes needed to achieve a targeted misclassification.

Projected Gradient Descent (PGD)

A strong, iterative white-box attack and the cornerstone for modern adversarial training. PGD applies the Fast Gradient Sign Method (FGSM) multiple times with a small step size, projecting perturbations back to a valid norm ball after each step.

Strength: Considered a universal first-order adversary; a model robust to PGD is often robust to many other attacks.
Role in Training: Adversarial training using PGD-generated examples is a leading defense technique.
Versatility: Can be easily configured for both targeted and untargeted attack objectives.

Transfer Attack

An attack where an adversarial example crafted against one model (the surrogate model) is also effective against a different, potentially black-box, target model. This property enables practical black-box targeted attacks.

Mechanism: Relies on the transferability of adversarial perturbations between models trained on similar data.
Black-Box Application: An adversary can train their own surrogate model, craft a targeted attack against it, and often have it transfer to the unknown target.
Security Implication: Means that even without internal model access, targeted attacks are feasible.

Robust Accuracy

A model's classification accuracy measured on a test set that includes adversarial examples. It provides a more comprehensive measure of real-world reliability than standard accuracy.

Key Metric: The primary benchmark for evaluating a model's defense against evasion attacks, including targeted ones.
Calculation: Typically reported as accuracy under attack from a specific threat model (e.g., "PGD robust accuracy").
Trade-off: Often exists between standard accuracy and robust accuracy; improving one can reduce the other.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Targeted Attack

What is a Targeted Attack?

Key Characteristics of a Targeted Attack

Specificity of Objective

Higher Attack Difficulty

Formal Optimization Problem

Use in Security Evaluation

Connection to Backdoor Attacks

Lower Transferability

How Does a Targeted Attack Work?

Common Targeted Attack Methods

Carlini & Wagner (C&W) Attack

Projected Gradient Descent (PGD)

Jacobian-based Saliency Map Attack (JSMA)

HopSkipJumpAttack

Adversarial Patch Attack

Backdoor Attack (Trojan Attack)

Targeted Attack vs. Untargeted Attack

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there