A transfer attack is an adversarial attack where an adversarial example crafted to fool a known surrogate model is also effective against a different, often black-box, target model. This transferability occurs because adversarial perturbations often exploit non-robust, generalizable features learned by many models trained on similar data. The attack is a cornerstone of black-box attack strategies, as it bypasses the need for direct access to the target model's internal parameters or gradients.
Glossary
Transfer Attack

What is a Transfer Attack?
A transfer attack exploits the shared vulnerabilities between machine learning models, allowing an adversarial example crafted for one model to deceive another.
The efficacy of a transfer attack is a critical measure of a model's adversarial robustness in real-world scenarios. Attackers typically use a white-box attack like Projected Gradient Descent (PGD) on a local surrogate to generate examples, then test them against the target. Defenses focus on reducing this cross-model vulnerability through techniques like adversarial training, which improves robustness by training on perturbed data, making features less transferable.
Key Characteristics of Transfer Attacks
Transfer attacks exploit the shared vulnerabilities between different machine learning models. Understanding their defining properties is crucial for building robust defenses.
Black-Box Exploitation
A transfer attack's primary characteristic is its effectiveness in a black-box setting. The attacker crafts an adversarial example against a local, known surrogate model (often a simpler or open-source model). This example is then transferred to attack a separate, proprietary target model without any knowledge of its internal weights, architecture, or gradients. This makes transfer attacks a practical threat against commercial APIs and closed-source systems.
- Core Mechanism: Relies on the transferability of adversarial perturbations across model decision boundaries.
- Attack Flow: Surrogate Model → Adversarial Example Crafting → Query Target Model.
Cross-Model Transferability
The success of a transfer attack hinges on the transferability of adversarial examples. This phenomenon occurs because different models, even with different architectures, often learn similar features and decision boundaries for the same task. Perturbations that exploit non-robust, superficial features in one model are likely to affect another.
Key factors influencing transferability include:
- Model Similarity: Attacks transfer more readily between models of the same family (e.g., different ResNet variants).
- Dataset Similarity: Models trained on similar data distributions share more vulnerabilities.
- Attack Strength: More potent attacks (e.g., PGD) often have higher transfer rates than simpler ones (e.g., FGSM).
Surrogate Model Selection
The attacker's choice of surrogate model is a critical strategic decision. The goal is to select or train a model whose decision space closely approximates the unknown target's. Common strategies include:
- Public Model Proxies: Using openly available pre-trained models (e.g., from TensorFlow Hub or PyTorch Hub) as surrogates.
- Model Stealing: First performing a model stealing attack to create a functional copy of the target, then using that copy as the surrogate.
- Ensemble Attacks: Crafting adversarial examples against an ensemble of diverse surrogate models, which often increases transferability by finding perturbations that fool multiple decision boundaries.
Practical Threat Vector
Transfer attacks represent one of the most realistic adversarial threats to deployed AI systems because they circumvent common defensive assumptions. They are frequently used in security audits and red-teaming exercises to simulate a determined external adversary.
Real-World Implications:
- API Security: Cloud-based vision or language model APIs are vulnerable if they do not employ specific input sanitization or adversarial detection.
- Physical-World Attacks: Many physical adversarial attacks, like malicious stickers on road signs, rely on transferability to work against the unknown vision systems in different autonomous vehicle models.
- Bypassing Gradient Masking: Defenses that rely on gradient masking may fail against transfer attacks, as the attack gradients come from the surrogate, not the defended target.
Defensive Countermeasures
Defending against transfer attacks requires techniques that fundamentally increase model adversarial robustness and reduce the shared vulnerabilities that enable transferability.
Primary Defenses:
- Adversarial Training: Training the target model with adversarial examples generated from itself (e.g., using PGD) is the most effective defense, but it is computationally expensive.
- Input Transformation & Randomization: Applying random resizing, cropping, or bit-depth reduction to inputs can break the carefully crafted adversarial perturbations.
- Gradient Obfuscation Avoidance: Defenses should not rely solely on gradient masking, as this does not stop transfer attacks from a surrogate.
- Ensemble Diversity: Deploying an ensemble of models with intentionally diverse architectures and training regimens can lower the success rate of a single transferred example.
Related Evaluation Concepts
Assessing a model's vulnerability to transfer attacks is a key component of a comprehensive adversarial testing regimen. This involves specific evaluation metrics and practices.
Key Evaluation Metrics:
- Transfer Success Rate: The percentage of adversarial examples crafted on a surrogate model that successfully fool the target model.
- Robust Accuracy: The model's accuracy under attack, measured using a suite of transferred adversarial examples.
Testing Practice: In red-teaming, evaluators assume a black-box posture and use a battery of surrogate models to generate candidate attacks, simulating a real-world threat actor. This provides a more realistic measure of adversarial robustness than white-box evaluations alone.
How a Transfer Attack Works
A transfer attack exploits the shared vulnerabilities between different machine learning models, enabling attacks on systems where the attacker has no internal access.
A transfer attack is an adversarial attack where an example crafted to fool one model, known as the surrogate model, also successfully deceives a different, often inaccessible target model. This property, called transferability, allows attackers to compromise black-box systems by using a local, white-box surrogate to generate malicious inputs. The attack leverages the fact that many models learn similar, non-robust features from data, making them susceptible to the same subtle perturbations.
The attack workflow involves training or querying a surrogate model to craft adversarial examples using white-box methods like the Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD). These crafted inputs are then transferred to the target. This technique is a cornerstone of black-box attack strategies and is critical for red-teaming and evaluating adversarial robustness in production systems where model internals are hidden.
Practical Examples & Attack Vectors
Transfer attacks exploit the shared vulnerabilities between different machine learning models. These examples illustrate how an adversarial example crafted for one model can successfully compromise another, often more secure, target.
Black-Box API Attack via Surrogate Model
An attacker aims to fool a commercial content moderation API (target) that flags toxic text. Without access to the API's model, the attacker:
- Queries the API with diverse text samples to build a dataset of inputs and labels.
- Trains a local BERT-based surrogate model on this collected data.
- Uses the Projected Gradient Descent (PGD) attack on the surrogate to generate adversarial examples where toxic content is subtly perturbed (e.g., character swaps, synonyms).
- These adversarial examples transfer with high probability, causing the black-box API to misclassify toxic text as safe, bypassing moderation.
- Surrogate: Locally trained BERT model.
- Target: Commercial moderation API (black-box).
- Technique: Query-based model extraction followed by white-box attack on the surrogate.
Transfer Between Model Families for Fraud Detection
A financial institution uses a gradient-boosted tree model (XGBoost) for credit card fraud detection. An attacker studies a publicly available neural network trained on similar transaction data. Using a white-box attack on this neural network surrogate, they generate adversarial transaction features (e.g., slight timing adjustments, amount modifications). Despite the fundamental architectural difference between neural networks and tree-based models, these adversarial examples successfully transfer, causing the production XGBoost model to classify fraudulent transactions as legitimate.
- Surrogate: Neural network (different architecture family).
- Target: XGBoost model in production.
- Implication: Vulnerabilities can transcend model architecture, residing in the data manifold itself.
Offensive Security & Red-Teaming
A security team performs red-teaming on a new large language model (LLM) API before release. They do not have white-box access to the production model. Their process:
-
Train a suite of smaller, open-source LLMs (e.g., Llama 2, Mistral) as surrogates.
-
Use jailbreaking techniques like GCG (Greedy Coordinate Gradient) or AutoPrompt to generate adversarial suffixes that force the surrogates to produce harmful content.
-
Test these adversarial prompts against the black-box target API. Successful transfers reveal critical vulnerabilities that are then patched via adversarial training before public launch.
-
Role: Proactive security assessment.
-
Surrogates: Open-source LLMs.
-
Outcome: Identification of transferable jailbreaks, leading to improved model robustness.
Defensive Implications & The Arms Race
The existence of transfer attacks has profound defensive implications:
-
Gradient Masking is Insufficient: Defenses that only obscure gradients (e.g., some forms of defensive distillation) may stop white-box attacks but fail against transfer attacks generated on a surrogate.
-
Adversarial Training is Key: The most robust defense, adversarial training with PGD, must be performed with a diverse set of attack methods and model architectures to create perturbations that generalize and harden the model against unknown surrogate-based attacks.
-
Ensemble Robustness: While ensembles of models can improve standard accuracy, they can be more vulnerable to transfer attacks if the individual models share similar decision boundaries. Promoting diversity in robustness among ensemble members is a active research area.
-
Core Defense: Adversarial training on transferred examples.
-
Challenge: Defending against attacks from an unbounded set of potential surrogate models.
Defensive Strategies Against Transfer Attacks
A comparison of primary defense methodologies used to protect machine learning models from transfer attacks, where adversarial examples crafted on a surrogate model are effective against a black-box target.
| Defensive Strategy | Adversarial Training | Input Preprocessing & Randomization | Gradient Obfuscation & Masking | Model Ensemble & Diversity |
|---|---|---|---|---|
Core Mechanism | Trains model on adversarial examples generated during training | Applies transformations (e.g., JPEG compression, noise) to inputs at inference | Alters model surface to produce uninformative or shattered gradients | Deploys multiple models with different architectures or training data |
Primary Defense Goal | Increase intrinsic model robustness to adversarial perturbations | Remove or distort adversarial perturbations before model processing | Obstruct gradient-based attack crafting on the surrogate model | Reduce transferability by breaking consistency across models |
Effectiveness Against Transfer Attacks | ||||
Robust Accuracy Impact | Increases robust accuracy but may reduce standard accuracy | Minimal impact on standard accuracy; limited robust accuracy gain | Can create false sense of security; often bypassed by adaptive attacks | Increases robust accuracy through collective decision-making |
Computational Overhead | High (requires generating adversarial examples during training) | Low (adds minimal processing at inference time) | Low to Moderate (modifies forward/backward pass) | High (requires training/maintaining multiple models) |
Risk of Gradient Masking | ||||
Common Techniques | PGD-based training, TRADES | Feature squeezing, randomization, JPEG compression | Defensive distillation, stochastic activation pruning | Bagging, adversarial training with different perturbations |
Key Limitation | Can overfit to the specific attack used during training | Defenses are often brittle and can be circumvented | Vulnerable to black-box attacks that bypass gradient estimation | Increased system complexity and inference cost |
Frequently Asked Questions
A transfer attack exploits the shared vulnerabilities between machine learning models, allowing an adversarial example crafted against one model to deceive another. This FAQ addresses the core mechanisms, implications, and defensive strategies surrounding this critical security phenomenon.
A transfer attack is an adversarial attack where an adversarial example crafted to fool one machine learning model (the surrogate model) is also effective against a different, often unknown, target model. This occurs due to shared, non-robust features learned by different models from similar data distributions. The attack's success hinges on the transferability of adversarial perturbations across model architectures, making it a potent threat against black-box models where internal details are hidden.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Transfer attacks exist within a broader ecosystem of adversarial machine learning concepts. Understanding these related terms is crucial for building a comprehensive security posture.
Black-Box Attack
A black-box attack is executed without access to the target model's internal architecture, parameters, or gradients. The attacker relies solely on the model's input-output behavior, typically via an API. This is the most common real-world attack scenario and the primary context in which transfer attacks are effective, as they allow an attacker to use a locally trained surrogate model to craft examples that transfer to the inaccessible target.
- Key Method: Query-based probing to infer decision boundaries.
- Relation to Transfer Attack: Transfer attacks are a primary strategy for executing practical black-box attacks.
Model Stealing Attack
A model stealing attack (or model extraction attack) is where an adversary uses query access to a target model to reconstruct a functionally equivalent surrogate model. This is often a prerequisite for a successful transfer attack. By training a local copy on inputs and outputs from the black-box target, the attacker creates the very model needed to craft adversarial examples that may transfer.
- Primary Goal: Intellectual property theft and/or acquiring a model for offline analysis.
- Synergy with Transfer Attacks: The stolen surrogate model becomes the platform for generating transferable adversarial examples against the original target.
Adversarial Robustness
Adversarial robustness is the property of a machine learning model that measures its ability to maintain correct predictions when subjected to adversarial attacks. It is the defensive counterpart to offensive techniques like transfer attacks. Robustness is quantitatively measured by robust accuracy—the accuracy on a test set containing adversarial examples.
- Core Challenge: Improving robustness often involves trade-offs with standard accuracy on clean data.
- Defensive Link: Defenses like adversarial training aim to increase robustness against a spectrum of attacks, including those that transfer.
Adversarial Training
Adversarial training is a primary defensive technique that improves a model's robustness by explicitly including adversarial examples in its training dataset. During training, the model learns from both clean data and perturbed data generated by attacks like Projected Gradient Descent (PGD). This process encourages the model to learn smoother, more generalized decision boundaries, which can reduce the success rate of transfer attacks.
- Standard Practice: A cornerstone method for building robust models.
- Effect on Transferability: Models trained with adversarial training often exhibit lower transferability of attacks between them.
Universal Adversarial Perturbation
A universal adversarial perturbation is a single, input-agnostic perturbation vector that, when added to most natural images, causes a model to misclassify them. This phenomenon highlights shared vulnerabilities across a model's data distribution. Crucially, these perturbations can also exhibit transferability across different models, meaning a universal perturbation crafted for one model can often fool another, making them a potent form of transfer attack.
- Key Characteristic: Input-agnostic; one perturbation fools many inputs.
- Transfer Attack Context: Represents a highly efficient and dangerous class of transferable attack.
Red-Teaming
In AI security, red-teaming is the systematic practice of simulating adversarial attacks against a model or system to proactively identify vulnerabilities before deployment. This offensive security exercise encompasses the entire toolkit of attacks, including crafting and testing transfer attacks against production models. The goal is to uncover failure modes, measure robust accuracy, and inform the development of stronger defenses.
- Proactive Security: A critical component of a mature ML security lifecycle.
- Operational Context: Transfer attacks are a key technique used during red-teaming exercises to simulate realistic black-box threat scenarios.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us