Inferensys

Glossary

Targeted Attack

A targeted adversarial attack is a security exploit where an adversary crafts an input to force a machine learning model to output a specific, incorrect prediction chosen by the attacker.
ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.
ADVERSARIAL TESTING

What is a Targeted Attack?

A targeted adversarial attack is a specific type of security exploit against machine learning models where the attacker aims to produce a pre-selected, incorrect output.

A targeted attack is an adversarial attack where the adversary crafts an input, known as an adversarial example, to cause a machine learning model to output a specific, incorrect class chosen by the attacker. This contrasts with an untargeted attack, which only seeks any misclassification. The attack is considered successful only if the model's prediction matches the attacker's designated target label, making it a more precise and often more challenging objective than simply causing an error.

Executing a targeted attack typically requires finding a minimal perturbation to a legitimate input that moves it across the model's decision boundary into the region of the target class. Common methods include optimization-based approaches like the Carlini & Wagner (C&W) attack. Defenses against such attacks include adversarial training with targeted examples and evaluating a model's robust accuracy. Targeted attacks are a core focus of red-teaming exercises to probe model security.

ADVERSARIAL TESTING

Key Characteristics of a Targeted Attack

A targeted adversarial attack is defined by the adversary's specific goal of causing a model to output a particular, predetermined incorrect class. This contrasts with untargeted attacks, which aim for any misclassification. The following characteristics distinguish its methodology and objectives.

01

Specificity of Objective

The defining feature of a targeted attack is the adversary's precise goal. Instead of causing any misclassification, the attacker aims to steer the model toward a specific, incorrect output class. For example, causing an autonomous vehicle's vision system to classify a stop sign as a 'yield' sign, or a financial fraud detector to label a fraudulent transaction as 'legitimate'. This requires more sophisticated perturbation crafting than untargeted attacks, as the adversarial example must not only cross a decision boundary but land within a specific, often distant, region of the output space.

02

Higher Attack Difficulty

Targeted attacks are generally more computationally complex and require larger perturbations than their untargeted counterparts. This is because the optimization problem is more constrained: the adversarial example must maximize the probability of the target class while simultaneously minimizing the probability of the true class and all other classes. Algorithms like the Carlini & Wagner (C&W) attack are explicitly designed for this purpose, formulating it as a minimization problem with a custom loss function that penalizes any output other than the desired target.

03

Formal Optimization Problem

Targeted attacks are often framed as a constrained optimization. The adversary seeks a minimal perturbation δ added to a clean input x such that:

  • f(x + δ) = y_target (model predicts the target class)
  • ||δ||_p ≤ ε (perturbation is small under a p-norm, e.g., L₂ or L_∞) This formulation is central to white-box attacks like C&W and Projected Gradient Descent (PGD) when configured for a target. The objective function directly encodes the distance to the target class's decision region, making the attack's success measurable and reproducible for benchmarking adversarial robustness.
04

Use in Security Evaluation

In red-teaming and security audits, targeted attacks are a critical stress test. They simulate a worst-case scenario where an adversary has a concrete, harmful objective. Successfully executing a targeted attack against a model reveals deeper vulnerabilities than an untargeted one. Evaluating a model's robust accuracy against targeted attacks provides a stringent measure of its reliability in high-stakes applications like content moderation (forcing a harmful post to be classified as 'safe') or medical diagnosis (forcing a malignant scan to be classified as 'benign').

05

Connection to Backdoor Attacks

Targeted attacks share conceptual ground with backdoor attacks, but operate at different phases. A backdoor attack is a poisoning attack executed during training, where a model learns to associate a trigger pattern with a specific target label. At inference, any input containing the trigger causes the targeted misclassification. In contrast, a standard targeted attack is an evasion attack at inference time on a clean model. Both aim for a specific incorrect output, making the analysis of a model's susceptibility to targeted perturbations a key part of defending against potential backdoors.

06

Lower Transferability

Adversarial examples crafted for a targeted attack on one model are generally less transferable to other models than those from untargeted attacks. This is because the precise perturbations needed to reach a specific class in one model's high-dimensional feature space are often highly specific to that model's unique weight geometry. While transfer attacks are possible, targeted transferability is a weaker phenomenon, making black-box targeted attacks more challenging and often requiring extensive query-based attack strategies to approximate the target model's decision boundaries.

ADVERSARIAL TESTING

How Does a Targeted Attack Work?

A targeted adversarial attack is a security exploit where an adversary crafts a malicious input to force a machine learning model to produce a specific, predetermined incorrect output.

A targeted attack is an evasion attack where the adversary's objective is not merely to cause a mistake, but to induce a specific misclassification. For example, an attacker might modify an image of a stop sign so an autonomous vehicle's vision system classifies it as a speed limit sign. This is distinct from an untargeted attack, which only seeks any incorrect output. The attack crafts an adversarial example by applying a subtle, often human-imperceptible, perturbation to a legitimate input, calculated to cross the model's decision boundary toward the chosen target class.

Execution typically requires calculating the gradient of the model's loss with respect to the input, guiding the perturbation. In a white-box attack, the attacker has full access to the model's architecture and parameters to compute this precisely. Black-box attacks use iterative query-based methods or transfer attacks from a surrogate model. Defenses include adversarial training with targeted examples and rigorous red-teaming to measure robust accuracy against such specified failures.

ADVERSARIAL TESTING

Common Targeted Attack Methods

Targeted attacks are distinguished by the adversary's goal: to force a specific, predetermined misclassification. These methods craft inputs to exploit model vulnerabilities with surgical precision.

ADVERSARIAL ATTACK TYPES

Targeted Attack vs. Untargeted Attack

A comparison of two fundamental adversarial attack strategies, distinguished by the specificity of the attacker's goal.

FeatureTargeted AttackUntargeted Attack

Primary Objective

Cause the model to output a specific, attacker-chosen incorrect class.

Cause the model to output any incorrect class.

Attack Formulation

Minimizes loss for the target class while maximizing loss for the true class.

Maximizes loss for the true class.

Adversarial Constraint

More constrained; must find a perturbation that moves the input to a precise region of the output space.

Less constrained; any perturbation that crosses a decision boundary is sufficient.

Typical Difficulty

Generally more difficult and computationally intensive to execute successfully.

Generally easier and less computationally intensive to execute.

Perturbation Magnitude

Often requires larger perturbations to reach a specific, distant target class.

Can often succeed with smaller perturbations to reach the nearest decision boundary.

Success Metric

Targeted success rate: the percentage of inputs misclassified as the specific target class.

Untargeted success rate (attack success rate): the percentage of inputs that are misclassified.

Common Use Cases

Testing for specific, high-consequence failures (e.g., misclassifying a 'stop' sign as a 'speed limit' sign). Red-teaming for precise vulnerabilities.

General robustness evaluation. Stress-testing a model's overall decision boundaries.

Transferability

Lower transferability between models, as the target class decision boundaries are model-specific.

Higher transferability, as decision boundaries for the original class may be similarly vulnerable across models.

TARGETED ATTACK

Frequently Asked Questions

A targeted adversarial attack aims to cause a machine learning model to output a specific, incorrect class chosen by the adversary. This glossary entry addresses common technical questions about how these attacks work, their mechanisms, and defensive strategies.

A targeted adversarial attack is a security exploit where an adversary crafts a malicious input, known as an adversarial example, to deliberately cause a machine learning model to misclassify it as a specific, incorrect label chosen by the attacker. Unlike an untargeted attack, which only seeks any incorrect output, a targeted attack aims for a precise, wrong prediction, such as forcing a facial recognition system to identify a person as someone else. This requires more sophisticated perturbation methods, often involving optimization to minimize input distortion while maximizing the probability of the target class.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.