Glossary

Model Merging (PEFT)

Model merging in PEFT is the process of combining delta weights or task vectors from multiple fine-tuned models into a single model to achieve multi-task capabilities or improved generalization.

Get in touch Learn more

Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.

PARAMETER-EFFICIENT FINE-TUNING

What is Model Merging (PEFT)?

Model merging in Parameter-Efficient Fine-Tuning (PEFT) is the process of combining the learned parameter changes (delta weights) from multiple independently fine-tuned models into a single, unified model to achieve multi-task capabilities or enhanced generalization.

In PEFT, each specialized model is created by training a small set of parameters—like Low-Rank Adaptation (LoRA) matrices or adapter modules—on top of a frozen base model. The resulting task vectors (the arithmetic difference between the fine-tuned and base weights) encode distinct capabilities. Model merging performs arithmetic operations, such as linear interpolation or task arithmetic, on these vectors to combine their knowledge into one model without catastrophic interference, enabling a single model to perform multiple tasks efficiently.

This technique is foundational for building multi-task models and improving cross-task generalization without the prohibitive cost of training separate full models. It leverages the modular nature of PEFT methods, where delta weights are often additive and disentangled, allowing for safe combination. The merged model retains the efficiency of the original PEFT approach, requiring only the storage and inference of the single, consolidated set of delta parameters alongside the original frozen backbone.

PEFT

Core Mechanisms of Model Merging

Model merging in PEFT is the process of combining the delta weights or task vectors from multiple independently fine-tuned models into a single model to achieve multi-task capabilities or improved generalization.

Task Vector Arithmetic

The foundational operation for model merging. A task vector is calculated as the arithmetic difference between a fine-tuned model's weights and the original pre-trained base model's weights (Δ = W_finetuned - W_base). Merging involves performing linear operations on these vectors.

Averaging: Combining vectors from similar tasks (Δ_merged = (Δ_A + Δ_B) / 2) to improve robustness.
Interpolation: Creating a weighted sum (Δ_merged = α * Δ_A + (1-α) * Δ_B) to balance task performance.
Negation: Subtracting a vector (W_new = W_base - Δ) to potentially remove undesired behaviors or "unlearn" a task.

TIES-Merging (TrIm, Elect Sign & Merge)

A state-of-the-art method that addresses interference from conflicting parameter signs across different task vectors. It performs three key steps:

TrIm: Retains only the top-k% most significant parameters in each task vector, sparsifying the updates.
Elect Sign: For each parameter, resolves sign conflicts by electing the majority sign across all vectors.
Disjoint Merge: Averages only the parameter values that agree with the elected sign, reducing destructive interference.

This method enables the stable merging of a large number of diverse models, significantly outperforming simple averaging.

DARE (Drop And REscale)

A technique designed to merge models fine-tuned with Low-Rank Adaptation (LoRA). It addresses the redundancy and overlap in LoRA delta weights.

Random Drop: A large percentage (e.g., 90%) of delta weights are randomly set to zero.
Rescaling: The remaining non-zero weights are rescaled (e.g., by 10x) to preserve the norm of the original delta.
Merging: The sparsified and rescaled deltas are then averaged.

DARE allows for the lossless merging of dozens of LoRA-tuned models without performance degradation, as the dropped parameters are largely redundant.

Slerp (Spherical Linear Interpolation)

An interpolation technique used when merging models, preferred over linear interpolation for certain parameter spaces. It interpolates along the geodesic (shortest path) on a hypersphere, treating weight sets as vectors.

Use Case: Particularly effective for merging models whose fine-tuned weights have similar magnitudes but different directions in the high-dimensional parameter space.
Process: Given two model weight vectors A and B, Slerp interpolates at angle θ, providing a smoother and more natural transition between model behaviors than linear interpolation (Lerp).
Application: Commonly used in merging diffusion models or foundational LLMs to create balanced blends of capabilities.

Model Soups & Gradient Souping

Methods for creating a unified model from multiple fine-tuned checkpoints.

Uniform Soup: The simplest form, averaging the weights of multiple models fine-tuned from the same base with different hyperparameters or data orders.
Greedy Soup: Iteratively adds a model to the soup only if it improves validation performance on a target task.
Gradient Souping: An advanced technique that merges models by approximating the task vectors that would result from fine-tuning on a mixture of all source tasks simultaneously. It computes a weighted average of gradients from each task to construct a more coherent merged model.

Reg-Merge (Regression-Based Merge)

A data-driven merging approach that frames merging as a regression problem. Instead of purely geometric operations on weights, it uses a small calibration dataset to learn the optimal linear combination of multiple model outputs.

Process: A lightweight regression layer (e.g., linear) is trained to combine the logits or hidden states of several frozen, task-specific models.
Advantage: Directly optimizes for performance on the target mixture of skills, often yielding better results than weight-space arithmetic.
PEFT Context: Highly compatible with merged PEFT modules, where the regression layer learns to weight the contributions of different adapters or LoRA modules.

PEFT

How Does Model Merging Work?

Model merging in Parameter-Efficient Fine-Tuning (PEFT) is a technique for combining multiple specialized adaptations into a single, more capable model without retraining from scratch.

Model merging is the process of arithmetically combining the delta weights or task vectors from multiple independently fine-tuned models into a unified parameter set. Each task vector represents the learned change from a base pre-trained model to a model adapted for a specific task. By strategically merging these vectors—through simple averaging, weighted summation, or more advanced linear arithmetic—a single model can acquire multi-task capabilities or improved generalization, all while preserving the efficiency gains of PEFT methods like LoRA or adapters.

The technique relies on the linear mode connectivity hypothesis, which posits that fine-tuned models often reside in linearly connected low-error basins within the loss landscape. This allows their weight spaces to be combined. Common merging algorithms include Task Arithmetic, which adds weighted task vectors to the base model, and Fisher Merging, which weights contributions by parameter importance. The result is a consolidated model that performs well across the source tasks, enabling efficient multi-task inference from a single checkpoint.

MODEL MERGING (PEFT)

Primary Use Cases & Applications

Model merging leverages the compact delta weights from PEFT to combine multiple specialized models into a single, more capable system. This enables efficient multi-task learning, improved generalization, and the creation of foundational multi-purpose models.

Multi-Task Model Creation

The primary application of model merging is to create a single model capable of performing multiple tasks without catastrophic forgetting or a proportional increase in parameters. By arithmetically combining the task vectors (ΔW) from several models fine-tuned with PEFT methods like LoRA or Adapters, a unified model inherits capabilities across domains.

Example: Merging a legal contract analysis adapter, a financial sentiment adapter, and a general QA adapter into one BERT backbone.
Benefit: Eliminates the need to maintain and serve multiple separate model instances, reducing deployment complexity and memory footprint.

EXPLORE

Improving Generalization & Robustness

Merging models fine-tuned on related but distinct datasets can lead to a more robust and generalizable model. This technique, sometimes called model soup or weight averaging, smooths the loss landscape. The combined model often outperforms any individual constituent on held-out validation sets and exhibits better out-of-distribution performance.

Mechanism: Averaging task vectors from domains A, B, and C creates a model whose parameters reside in a flatter, more general region of the optimization space.
Use Case: Merging medical imaging adapters trained on X-rays, MRIs, and CT scans to create a more versatile diagnostic assistant.

EXPLORE

Efficient Continual Learning

Model merging provides a scalable strategy for continual learning. Instead of retraining on an ever-expanding combined dataset, new tasks are learned in isolation via PEFT, producing a compact task vector. These vectors can then be selectively merged or composed to update the central model.

Process: 1) Freeze base model. 2) Learn Task 1 with LoRA, store ΔW₁. 3) Learn Task 2 with LoRA, store ΔW₂. 4) Merge ΔW₁ and ΔW₂ (e.g., via averaging) with base weights.
Advantage: Mitigates catastrophic forgetting by preserving the base model and addingitive, non-destructive updates.

EXPLORE

Creating Foundational Multi-Modal Models

For multimodal architectures like CLIP or BLIP, merging is key to building a unified model proficient across diverse vision-language tasks. Independent VL-Adapters for VQA, image captioning, and visual grounding can be merged, endowing a single model with broad multimodal reasoning.

Architecture: The frozen multimodal fusion backbone remains stable while adapter deltas from each task are combined.
Result: A single model endpoint that can handle retrieval, generation, and classification across visual and textual inputs, optimizing inference infrastructure.

EXPLORE

Personalization & Specialization at Scale

Enterprises can maintain a central base model while generating hundreds of personalized or domain-specialized variants via PEFT. For deployment, user-specific or department-specific delta weights can be dynamically loaded or merged on-demand, enabling mass customization without model proliferation.

Workflow: A financial services firm fine-tunes a base LLM with LoRA for equities analysis, another for risk assessment, and another for compliance. These are stored as small task vectors (<1% of model size).
Deployment: The appropriate task vector is merged with the base model at runtime or service startup based on the client or query context.

EXPLORE

Research: Model Arithmetic & Editing

Beyond simple averaging, merging enables research into model arithmetic—algebraic manipulation of task vectors to engineer model behavior. For example, Base + (Finance Δ) + (Formal_Tone Δ) - (Informal_Tone Δ) creates a financially expert model with formal communication.

Concept: Task vectors are treated as representations of abstract properties (e.g., 'creativity', 'factuality', 'domain knowledge').
Potential: Enables precise, post-hoc steering of model attributes and the removal of undesired behaviors learned during fine-tuning by subtracting corresponding vectors.

EXPLORE

COMPARISON

Model Merging vs. Alternative Multi-Task Approaches

This table compares the core characteristics of the Model Merging paradigm against other established methods for building multi-task capable models.

Feature / Metric	Model Merging (PEFT)	Multi-Task Learning (MTL)	Mixture-of-Experts (MoE)	Single Multi-Task Model
Core Paradigm	Arithmetic combination of task-specific delta weights	Joint training on multiple tasks with a shared backbone	Sparse activation of specialized expert sub-networks	Full fine-tuning on a blended multi-task dataset
Parameter Efficiency
Preserves Base Model Knowledge
Training Compute Overhead	Low (independent fine-tuning)	High (joint optimization)	Very High (expert routing + training)	High (full fine-tuning)
Task Addition / Removal	Modular; additive or subtractive	Requires retraining or complex continual learning	Requires expert addition/retraining	Requires full retraining from base
Inference Cost	Same as base model	Same as base model	~2-4x base model (active params)	Same as base model
Risk of Task Interference	Very Low (post-hoc merging)	High (gradient competition)	Low (experts are specialized)	High (single set of weights)
Typical Use Case	Combining 3-10 specialized adapters (e.g., code, math, chat)	Training a model on closely related tasks (e.g., NER, POS, Chunking)	Extremely large-scale models with 1000s of tasks/capabilities	Domain-specific model for 2-3 tightly coupled tasks

MODEL MERGING

Frequently Asked Questions

Model merging is a core technique in Parameter-Efficient Fine-Tuning (PEFT) that enables the creation of multi-capability models by combining specialized adaptations. This FAQ addresses key technical questions about its mechanisms, applications, and implementation.

Model merging in PEFT is the process of arithmetically combining the delta weights or task vectors from multiple independently fine-tuned models into a single unified model. It works by first fine-tuning a shared frozen backbone model on different tasks using a PEFT method like LoRA or adapters, which produces a small set of task-specific parameters. The core operation is a weighted summation: Merged_Weights = Base_Weights + α * Task_Vector_A + β * Task_Vector_B, where α and β are scaling coefficients. This creates a model that can perform multiple tasks without the catastrophic interference typical of sequential fine-tuning, as the majority of the base model's knowledge remains intact and stable.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

MODEL MERGING (PEFT)

Related Terms

Model merging combines the learned adaptations from multiple fine-tuned models. These related concepts define the components, operations, and frameworks that make this process possible and efficient.

Delta Weights

Delta weights are the small set of learned parameter changes (ΔW) applied to a frozen pre-trained model during parameter-efficient fine-tuning. They represent the task-specific adaptation.

In PEFT, only the delta weights are trained and stored, not the entire model.
These are the fundamental unit for model merging, as merging operations are performed on the deltas from different tasks.
Storing only deltas is highly storage-efficient, enabling the maintenance of many task-specific adaptations from a single base model.

Task Vectors

A task vector is the arithmetic difference between the weights of a fine-tuned model and its original pre-trained base model. It mathematically encapsulates the knowledge acquired for a specific task.

Calculated as: Task Vector = Fine-tuned Weights - Base Weights.
Enables operations like model merging (adding task vectors) and task negation (subtracting task vectors).
In PEFT, the task vector is often sparse or low-rank, corresponding directly to the trained adapter or LoRA matrices.

AdapterFusion

AdapterFusion is a two-stage PEFT method designed for multi-task learning. It first trains multiple independent, task-specific adapters, then learns a composition layer that dynamically combines them.

Stage 1: Train lightweight adapters for N different tasks on a frozen backbone.
Stage 2: Freeze the adapters and train a new fusion layer that learns to query and combine them for a new target task.
This is a structured approach to model merging, allowing the composite model to leverage knowledge from multiple source tasks without catastrophic interference.

Model Soup

Model soup is a merging technique where the weights of multiple models fine-tuned from the same pre-trained checkpoint are averaged to create a single, more robust model.

Performs a simple arithmetic mean of the weight parameters from different fine-tuned checkpoints.
Often improves generalization and out-of-distribution robustness compared to any single constituent model.
With PEFT, creating a soup is highly efficient because you average only the small delta weights or adapter parameters, not the entire massive backbone.

Task Arithmetic

Task arithmetic is a framework for editing models by adding and subtracting task vectors. It treats model adaptation as operations in weight space.

Addition (Base + Vector_A + Vector_B) merges capabilities from multiple tasks.
Negation (Base + Vector_A - Vector_B) can remove an undesired skill or bias.
Scaling (Base + λ * Vector_A) adjusts the strength of a task's influence.
This provides a principled, mathematical basis for model merging and editing using the compact representations learned via PEFT methods.

Multi-Task PEFT

Multi-Task PEFT refers to strategies for adapting a single pre-trained model to perform well on multiple downstream tasks simultaneously, using parameter-efficient methods.

Approaches include training a shared set of adapters on a mixed multi-task dataset, or using merging techniques like AdapterFusion.
Contrasts with sequential model merging, which combines models after single-task training.
The goal is to achieve high performance across all tasks while adding only a minimal number of parameters per task, avoiding the need to store many separate models.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Model Merging (PEFT)

What is Model Merging (PEFT)?

Core Mechanisms of Model Merging

Task Vector Arithmetic

TIES-Merging (TrIm, Elect Sign & Merge)

DARE (Drop And REscale)

Slerp (Spherical Linear Interpolation)

Model Soups & Gradient Souping

Reg-Merge (Regression-Based Merge)

How Does Model Merging Work?

Primary Use Cases & Applications

Multi-Task Model Creation

Improving Generalization & Robustness

Efficient Continual Learning

Creating Foundational Multi-Modal Models

Personalization & Specialization at Scale

Research: Model Arithmetic & Editing

Model Merging vs. Alternative Multi-Task Approaches

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there