Inferensys

Glossary

Injection Points

Injection points are the specific architectural locations within a neural network where parameter-efficient fine-tuning (PEFT) modules, such as adapters or prefixes, are inserted to enable efficient model adaptation.
ML engineer managing model versions on laptop, version history visible, technical Git-like workflow.
PARAMETER-EFFICIENT FINE-TUNING

What are Injection Points?

Injection points are the specific architectural locations within a neural network where parameter-efficient fine-tuning (PEFT) modules are inserted to adapt a model.

Injection points are the predetermined layers or components in a frozen pre-trained model where small, trainable modules like adapters or prefix vectors are integrated. Common locations include after the attention mechanism or the feed-forward network within a transformer block. The strategic selection of these points determines how the new task-specific information flows through and modifies the model's existing representations, enabling efficient adaptation with minimal new parameters.

For encoder models like BERT, injection points are typically within each transformer layer. In multimodal architectures (e.g., CLIP), points may be placed in both the vision and text encoders, or at their fusion layers. The choice of injection point is a critical hyperparameter, balancing adaptation effectiveness against added computational overhead, and is often informed by the target task and the base model's architecture.

ARCHITECTURAL CONCEPT

Key Characteristics of Injection Points

Injection points are the specific, architecturally defined locations within a neural network where parameter-efficient modules are inserted to enable task adaptation without full retraining.

01

Architectural Specificity

Injection points are not arbitrary; they are strategically chosen based on the model's computational graph and information flow. Common locations in transformer architectures include:

  • After the multi-head attention module, to modify contextual representations.
  • After the feed-forward network (FFN), to adapt processed features.
  • Within cross-attention layers in encoder-decoder or multimodal models, to align information between modalities. Placing modules at these points allows the injected parameters to intercept and transform the activations that carry the most task-relevant information.
02

Modularity and Non-Destructiveness

A core principle of injection points is non-destructive interference with the frozen backbone. The inserted module (e.g., an adapter) processes the incoming activation and adds its output, creating a residual connection. The operation is typically: y = x + f(x), where x is the original activation and f is the injected module. This ensures the original, general-purpose knowledge encoded in the pre-trained weights remains intact and accessible, while the module learns a task-specific delta.

03

Dimensionality Control (Bottleneck)

To maintain parameter efficiency, modules at injection points almost always employ a bottleneck architecture. For an adapter, this involves:

  • A down-projection to a lower-dimensional space (e.g., 64 or 128 dimensions).
  • A non-linearity (e.g., GELU).
  • An up-projection back to the original feature dimension. The bottleneck dimension (or reduction factor r) is the critical hyperparameter, trading off adapter capacity against the number of new parameters. A common setting is r=16, reducing parameters by ~0.5% of the original model.
04

Granularity and Coverage

The choice of injection point granularity significantly impacts performance and efficiency.

  • Layer-wise Injection: A module is inserted at a chosen point in every transformer block (e.g., after every FFN). This provides comprehensive adaptation but higher parameter cost.
  • Selective Injection: Modules are inserted only into a subset of layers, often based on heuristics (e.g., only the top 10 layers) or learned importance (as in AdapterDrop). This reduces compute during inference.
  • Component-wise Injection: In methods like IA³, lightweight scaling vectors are injected at multiple points within a single layer (e.g., for keys, values, and FFN outputs), offering very fine-grained control.
05

Modality-Specific Considerations

The optimal injection point varies by model architecture and data modality.

  • Encoder Models (e.g., BERT): Injections are typically placed within the encoder stack. BERT Adapters commonly insert modules after the FFN layer of each transformer block for NLP understanding tasks.
  • Vision Models (e.g., ViT): ViT Adapters may inject at different stages of the visual backbone, sometimes incorporating spatial feature maps for dense prediction tasks like segmentation.
  • Multimodal Models (e.g., CLIP, BLIP): VL-Adapters or Cross-Modal Adapters require injection points that affect the fusion mechanisms between modalities, such as within cross-attention layers or at the output of unimodal encoders before fusion.
06

Impact on Forward Pass Latency

While parameter-efficient, injection points add computational overhead to the model's forward pass. The primary costs are:

  • The extra matrix multiplications of the injected module.
  • The memory bandwidth required to load the additional parameters and intermediate activations. Techniques like AdapterDrop explicitly address this by dynamically pruning adapters from lower, less critical layers during inference. The design of the injection point and module (e.g., using a smaller bottleneck dimension) is a direct trade-off between adaptation quality and inference speed.
ARCHITECTURAL MECHANISM

How Injection Points Work in PEFT

Injection points are the predetermined architectural locations within a frozen neural network where parameter-efficient modules are inserted to enable task-specific adaptation.

An injection point is a specific layer or sub-component within a pre-trained model's architecture where a small, trainable PEFT module is integrated. Common points in transformers include the outputs of the multi-head attention mechanism or the feed-forward network. The location is critical, as it determines how the injected module interacts with and transforms the model's internal activations to steer behavior for a new task while the vast majority of original weights remain frozen.

The choice of injection point is a key hyperparameter that balances adaptation efficacy with computational overhead. Inserting modules after attention layers allows direct modulation of contextual representations, while placement after feed-forward networks affects processed features. For multimodal models, injection points may be placed within cross-attention or fusion layers to efficiently adapt interactions between modalities like text and vision.

ARCHITECTURAL LOCATIONS

Common Injection Points in Transformer Models

This table compares the primary architectural locations within a transformer model where parameter-efficient modules like adapters or prefixes can be inserted, detailing their typical layer placement, function, and common use cases.

Injection PointTypical Layer PlacementPrimary FunctionCommon PEFT MethodsTarget Model Types

Attention Key/Value Projections

Inside each Multi-Head Attention block

Modulate the attention distribution by transforming the key and value vectors before the attention calculation.

Prefix Tuning, LoRA, IA³

Decoder-only LLMs, Encoder-Decoder, Encoder-only

Attention Output Projection

After the attention concatenation & linear projection

Transform the combined output from all attention heads before the residual connection.

Adapters, LoRA

All Transformer variants

Feed-Forward Network (Intermediate)

Between the two linear layers of the FFN (after activation)

Apply a non-linear transformation to the activations within the model's dense block.

Adapters, LoRA, IA³

All Transformer variants

Feed-Forward Network (Output)

After the second FFN linear layer, before residual add

Project the transformed features back to the model's hidden dimension.

Adapters, LoRA

All Transformer variants

Layer Normalization Parameters

Within each pre/post LayerNorm module

Re-scale and re-center the activations via trainable gain (γ) and bias (β) parameters.

BitFit (bias-only)

All Transformer variants

Cross-Attention Mechanisms

Inside the decoder's cross-attention blocks (Encoder-Decoder)

Modulate how the decoder attends to the encoder's output representations.

Adapters, LoRA, Prefix Tuning

Encoder-Decoder (e.g., T5, BART)

Embedding Layer

At the input token or patch embedding projection

Adjust the initial representation of input tokens or image patches.

Prompt Tuning, LoRA on embedding matrix

All Transformer variants

Modality-Specific Projections

Within input/output heads of multimodal models

Align features from different modalities (e.g., text, image, audio) into a shared space.

VL-Adapters, Cross-Modal Adapters

Multimodal (e.g., CLIP, BLIP, AudioLM)

INJECTION POINTS

Frequently Asked Questions

Injection points are the architectural locations within a neural network where parameter-efficient modules are inserted to enable task-specific adaptation without retraining the entire model. This FAQ addresses common technical questions about their function, selection, and impact.

An injection point is a specific, pre-defined location within a neural network's architecture where a small, trainable parameter-efficient fine-tuning (PEFT) module is inserted to adapt the model's behavior for a new task. These points are strategically chosen layers—such as after the multi-head attention block or the feed-forward network in a transformer—where the model's intermediate representations (activations) can be most effectively transformed. By inserting modules like adapters, prefixes, or LoRA matrices only at these points, the vast majority of the pre-trained model's weights remain frozen, enabling efficient adaptation with a tiny fraction of the original parameters.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.