Injection points are the predetermined layers or components in a frozen pre-trained model where small, trainable modules like adapters or prefix vectors are integrated. Common locations include after the attention mechanism or the feed-forward network within a transformer block. The strategic selection of these points determines how the new task-specific information flows through and modifies the model's existing representations, enabling efficient adaptation with minimal new parameters.
Glossary
Injection Points

What are Injection Points?
Injection points are the specific architectural locations within a neural network where parameter-efficient fine-tuning (PEFT) modules are inserted to adapt a model.
For encoder models like BERT, injection points are typically within each transformer layer. In multimodal architectures (e.g., CLIP), points may be placed in both the vision and text encoders, or at their fusion layers. The choice of injection point is a critical hyperparameter, balancing adaptation effectiveness against added computational overhead, and is often informed by the target task and the base model's architecture.
Key Characteristics of Injection Points
Injection points are the specific, architecturally defined locations within a neural network where parameter-efficient modules are inserted to enable task adaptation without full retraining.
Architectural Specificity
Injection points are not arbitrary; they are strategically chosen based on the model's computational graph and information flow. Common locations in transformer architectures include:
- After the multi-head attention module, to modify contextual representations.
- After the feed-forward network (FFN), to adapt processed features.
- Within cross-attention layers in encoder-decoder or multimodal models, to align information between modalities. Placing modules at these points allows the injected parameters to intercept and transform the activations that carry the most task-relevant information.
Modularity and Non-Destructiveness
A core principle of injection points is non-destructive interference with the frozen backbone. The inserted module (e.g., an adapter) processes the incoming activation and adds its output, creating a residual connection. The operation is typically: y = x + f(x), where x is the original activation and f is the injected module. This ensures the original, general-purpose knowledge encoded in the pre-trained weights remains intact and accessible, while the module learns a task-specific delta.
Dimensionality Control (Bottleneck)
To maintain parameter efficiency, modules at injection points almost always employ a bottleneck architecture. For an adapter, this involves:
- A down-projection to a lower-dimensional space (e.g., 64 or 128 dimensions).
- A non-linearity (e.g., GELU).
- An up-projection back to the original feature dimension.
The bottleneck dimension (or reduction factor
r) is the critical hyperparameter, trading off adapter capacity against the number of new parameters. A common setting isr=16, reducing parameters by ~0.5% of the original model.
Granularity and Coverage
The choice of injection point granularity significantly impacts performance and efficiency.
- Layer-wise Injection: A module is inserted at a chosen point in every transformer block (e.g., after every FFN). This provides comprehensive adaptation but higher parameter cost.
- Selective Injection: Modules are inserted only into a subset of layers, often based on heuristics (e.g., only the top 10 layers) or learned importance (as in AdapterDrop). This reduces compute during inference.
- Component-wise Injection: In methods like IA³, lightweight scaling vectors are injected at multiple points within a single layer (e.g., for keys, values, and FFN outputs), offering very fine-grained control.
Modality-Specific Considerations
The optimal injection point varies by model architecture and data modality.
- Encoder Models (e.g., BERT): Injections are typically placed within the encoder stack. BERT Adapters commonly insert modules after the FFN layer of each transformer block for NLP understanding tasks.
- Vision Models (e.g., ViT): ViT Adapters may inject at different stages of the visual backbone, sometimes incorporating spatial feature maps for dense prediction tasks like segmentation.
- Multimodal Models (e.g., CLIP, BLIP): VL-Adapters or Cross-Modal Adapters require injection points that affect the fusion mechanisms between modalities, such as within cross-attention layers or at the output of unimodal encoders before fusion.
Impact on Forward Pass Latency
While parameter-efficient, injection points add computational overhead to the model's forward pass. The primary costs are:
- The extra matrix multiplications of the injected module.
- The memory bandwidth required to load the additional parameters and intermediate activations. Techniques like AdapterDrop explicitly address this by dynamically pruning adapters from lower, less critical layers during inference. The design of the injection point and module (e.g., using a smaller bottleneck dimension) is a direct trade-off between adaptation quality and inference speed.
How Injection Points Work in PEFT
Injection points are the predetermined architectural locations within a frozen neural network where parameter-efficient modules are inserted to enable task-specific adaptation.
An injection point is a specific layer or sub-component within a pre-trained model's architecture where a small, trainable PEFT module is integrated. Common points in transformers include the outputs of the multi-head attention mechanism or the feed-forward network. The location is critical, as it determines how the injected module interacts with and transforms the model's internal activations to steer behavior for a new task while the vast majority of original weights remain frozen.
The choice of injection point is a key hyperparameter that balances adaptation efficacy with computational overhead. Inserting modules after attention layers allows direct modulation of contextual representations, while placement after feed-forward networks affects processed features. For multimodal models, injection points may be placed within cross-attention or fusion layers to efficiently adapt interactions between modalities like text and vision.
Common Injection Points in Transformer Models
This table compares the primary architectural locations within a transformer model where parameter-efficient modules like adapters or prefixes can be inserted, detailing their typical layer placement, function, and common use cases.
| Injection Point | Typical Layer Placement | Primary Function | Common PEFT Methods | Target Model Types |
|---|---|---|---|---|
Attention Key/Value Projections | Inside each Multi-Head Attention block | Modulate the attention distribution by transforming the key and value vectors before the attention calculation. | Prefix Tuning, LoRA, IA³ | Decoder-only LLMs, Encoder-Decoder, Encoder-only |
Attention Output Projection | After the attention concatenation & linear projection | Transform the combined output from all attention heads before the residual connection. | Adapters, LoRA | All Transformer variants |
Feed-Forward Network (Intermediate) | Between the two linear layers of the FFN (after activation) | Apply a non-linear transformation to the activations within the model's dense block. | Adapters, LoRA, IA³ | All Transformer variants |
Feed-Forward Network (Output) | After the second FFN linear layer, before residual add | Project the transformed features back to the model's hidden dimension. | Adapters, LoRA | All Transformer variants |
Layer Normalization Parameters | Within each pre/post LayerNorm module | Re-scale and re-center the activations via trainable gain (γ) and bias (β) parameters. | BitFit (bias-only) | All Transformer variants |
Cross-Attention Mechanisms | Inside the decoder's cross-attention blocks (Encoder-Decoder) | Modulate how the decoder attends to the encoder's output representations. | Adapters, LoRA, Prefix Tuning | Encoder-Decoder (e.g., T5, BART) |
Embedding Layer | At the input token or patch embedding projection | Adjust the initial representation of input tokens or image patches. | Prompt Tuning, LoRA on embedding matrix | All Transformer variants |
Modality-Specific Projections | Within input/output heads of multimodal models | Align features from different modalities (e.g., text, image, audio) into a shared space. | VL-Adapters, Cross-Modal Adapters | Multimodal (e.g., CLIP, BLIP, AudioLM) |
Frequently Asked Questions
Injection points are the architectural locations within a neural network where parameter-efficient modules are inserted to enable task-specific adaptation without retraining the entire model. This FAQ addresses common technical questions about their function, selection, and impact.
An injection point is a specific, pre-defined location within a neural network's architecture where a small, trainable parameter-efficient fine-tuning (PEFT) module is inserted to adapt the model's behavior for a new task. These points are strategically chosen layers—such as after the multi-head attention block or the feed-forward network in a transformer—where the model's intermediate representations (activations) can be most effectively transformed. By inserting modules like adapters, prefixes, or LoRA matrices only at these points, the vast majority of the pre-trained model's weights remain frozen, enabling efficient adaptation with a tiny fraction of the original parameters.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Injection points are the architectural locations where PEFT modules are integrated. The following terms define the specific modules inserted, the models they target, and the broader adaptation paradigm.
Adapter
A small, trainable neural network module inserted at specific injection points within a frozen pre-trained model. It learns task-specific transformations of the intermediate activations flowing through the network.
- Typically consists of a down-projection, non-linearity, and up-projection.
- The bottleneck dimension controls its capacity and parameter efficiency.
- Enables efficient adaptation by updating only the adapter parameters, leaving the original model weights frozen.
Frozen Backbone
The large, pre-trained base model (e.g., BERT, ViT, CLIP) whose parameters are kept fixed and non-trainable during parameter-efficient fine-tuning.
- Serves as a stable feature extractor with rich prior knowledge.
- All adaptation occurs via small, added modules (like adapters or prefixes) at the injection points.
- This is the core premise of PEFT: preserving the backbone's general capabilities while learning task-specific behaviors efficiently.
Delta Weights
The small set of learned parameter changes (Δ) that represent the task-specific adaptation applied to a frozen pre-trained model during PEFT.
- In methods like LoRA, these are the low-rank matrices added to the original weights.
- The collection of all delta weights from a fine-tuning run forms a Task Vector.
- This conceptualization enables powerful techniques like model merging, where deltas from multiple tasks are combined arithmetically.
Encoder PEFT
The application of parameter-efficient fine-tuning techniques specifically to encoder-only transformer models like BERT, RoBERTa, or DeBERTa.
- Injection points are typically after the feed-forward network or attention output within each transformer layer.
- Common tasks include text classification, named entity recognition (NER), and question answering.
- BERT Adapters are a canonical example, demonstrating efficient adaptation for natural language understanding.
Multimodal Fusion PEFT
Using parameter-efficient methods to adapt the fusion mechanisms in pre-trained multimodal models (e.g., CLIP, BLIP).
- Injection points target layers responsible for aligning and combining information from different modalities (text, image, audio).
- VL-Adapters and Cross-Modal Adapters are modules designed for these fusion points.
- Enables efficient adaptation to downstream tasks like visual question answering (VQA) or image captioning without retraining the entire complex model.
Bottleneck Dimension
A key hyperparameter in adapter-based PEFT that defines the size of the adapter's hidden layer, controlling its capacity and parameter count.
- It creates a computational bottleneck, first projecting activations down to this dimension and then back up.
- A smaller bottleneck dimension increases parameter efficiency but may reduce adapter capacity.
- It is often set via a reduction factor (e.g., reducing 768-dim activations to a bottleneck of 48, a factor of 16).

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us