Federated Mirror Descent is a foundational optimization algorithm for federated learning that generalizes standard gradient descent. It performs updates by mapping gradients from a primal model space to a dual space via a mirror map, defined by a strictly convex Bregman divergence like KL-divergence. This allows FMD to naturally handle constraints and non-Euclidean geometries, such as the probability simplex, which are common in applications like personalized recommendation or multi-class classification.
Glossary
Federated Mirror Descent

What is Federated Mirror Descent?
Federated Mirror Descent (FMD) is a generalization of federated optimization that performs gradient updates in a dual space defined by a Bregman divergence, offering a unified framework for handling constraints and non-Euclidean geometry.
The algorithm operates in rounds: clients compute gradients on local data and send these updates to a server. The server performs a mirror descent step in the dual space, then maps the result back to the primal model space for aggregation and redistribution. This framework unifies many federated algorithms, as Federated Averaging (FedAvg) is a special case using squared Euclidean distance. FMD provides stronger theoretical convergence guarantees, especially for non-convex problems and heterogeneous (non-IID) client data distributions.
Key Features of Federated Mirror Descent
Federated Mirror Descent (FMD) generalizes federated optimization by performing updates in a dual space defined by a Bregman divergence. This provides a unified framework for handling constraints, non-Euclidean geometry, and heterogeneous client objectives.
Bregman Divergence as Geometry
The core mechanism of FMD is the use of a Bregman divergence (e.g., KL divergence, squared Euclidean distance) to define the geometry of the optimization space. Instead of a standard gradient step, the update is a mirror map from the dual (gradient) space back to the primal (model) space. This allows the algorithm to naturally adapt to the geometry of the problem, such as the simplex for probability distributions, which is crucial for tasks like personalized recommendation or classification with constraints.
Unified Framework for Constraints
A primary advantage of FMD is its inherent ability to handle constraints. By choosing a Bregman divergence whose domain matches the constraint set (e.g., the negative entropy for the probability simplex), the mirror map automatically projects updates to stay within the feasible region. This eliminates the need for separate, potentially expensive projection steps after each federated round, making constrained federated optimization—common in applications with privacy budgets or resource limits—more efficient and stable.
Generalization of FedAvg and Adaptive Methods
FMD is not a single algorithm but a meta-algorithm. Specific choices recover well-known methods:
- With squared Euclidean distance, FMD reduces to standard Federated Averaging (FedAvg).
- With adaptive divergences or dual-space learning rates, it generalizes FedOpt frameworks like FedAdam or FedYogi. This demonstrates FMD's role as a theoretical and practical umbrella, showing how different federated optimizers relate through the lens of Bregman geometry.
Mitigation of Client Drift
FMD can be designed to directly combat client drift. By using a Bregman divergence that acts as a strong regularizer, the local objective on each client penalizes deviation from a global reference point more effectively than a simple proximal term (as in FedProx). This regularization is geometrically aware, leading to more consistent convergence across clients with highly non-IID data and reducing the number of communication rounds required to reach a target accuracy.
Dual Averaging for Sparse Communication
A common instantiation is Federated Dual Averaging. Clients compute gradients and send them (or their dual averages) to the server. The server performs a single mirror step on the aggregated dual vector. This structure is particularly amenable to gradient compression techniques like quantization or sparsification, as the compression is applied in the dual space. The mirror map's properties can help mitigate the distortion caused by compression, maintaining convergence guarantees where simple averaging might fail.
Theoretical Convergence Guarantees
FMD enjoys strong theoretical foundations. Under standard assumptions (convexity, smoothness), convergence rates can be derived that explicitly depend on the Bregman divergence's strong convexity parameter and client heterogeneity. These rates often match or generalize those of standard federated SGD, providing formal assurance of performance. The analysis typically involves tracking the growth of the Bregman divergence between the global model and an optimal solution, a more general measure than Euclidean distance.
Federated Mirror Descent vs. Other Federated Optimizers
A technical comparison of Federated Mirror Descent (FMD) against prominent federated optimization algorithms, highlighting their core mechanisms, geometric handling, and suitability for different data and system constraints.
| Algorithmic Feature / Property | Federated Mirror Descent (FMD) | Federated Averaging (FedAvg) | FedProx | SCAFFOLD |
|---|---|---|---|---|
Core Optimization Framework | Generalized gradient descent using Bregman divergences | Simple weighted averaging of client SGD updates | FedAvg with a proximal term for client regularization | Uses control variates to correct client drift |
Underlying Geometry | Non-Euclidean (handles simplex, L1-ball, etc.) | Euclidean (L2 geometry) | Euclidean (L2 geometry) | Euclidean (L2 geometry) |
Primary Design Goal | Unified framework for constraints & non-Euclidean geometry | Communication efficiency & simplicity | Stability under systems & statistical heterogeneity | Convergence speed under data heterogeneity (non-IID) |
Handles Constraints Natively | ||||
Client-Side Computation | Mirror descent step (gradient in dual space) | Multiple local SGD steps | Multiple local proximal SGD steps | Multiple local SGD steps with control variate adjustment |
Server-Side Aggregation | Mirror averaging or dual averaging in the primal space | Weighted averaging of parameters | Weighted averaging of parameters | Averaging of parameters and control variates |
Theoretical Guarantees Under Heterogeneity | Convergence for convex/non-convex in general geometry | Convergence issues under severe non-IID data | Improved convergence with bounded client drift | Linear speedup and convergence under non-IID data |
Communication Cost per Round | Same as FedAvg (transmit primal/dual parameters) | Same as FMD (transmit model parameters) | Same as FedAvg (transmit model parameters) | Higher (transmits model parameters + control variates) |
Key Hyperparameter(s) | Mirror map (e.g., negative entropy), learning rate | Local epochs, client fraction, learning rate | Proximal term weight (mu), local epochs | Local/global learning rates, control variate update frequency |
Typical Use Case | Learning on probability simplices (e.g., recommendation), sparse models | Baseline for image/vision models, IID-like data | Cross-device FL with varying client availability & slow networks | Cross-silo FL with stable clients and highly non-IID data |
Frequently Asked Questions
Federated Mirror Descent (FMD) is a foundational optimization framework for federated learning that generalizes gradient-based updates to handle complex geometries and constraints. This FAQ addresses its core mechanisms, advantages, and practical applications.
Federated Mirror Descent (FMD) is a generalization of federated gradient descent that performs optimization updates in a dual space defined by a Bregman divergence, enabling efficient learning on non-Euclidean geometries and constrained parameter sets. It works by iterating a two-step process: first, each client computes a gradient in the primal model space; second, this gradient is used to perform a mirror map update in a dual space, which is then mapped back to the primal space via the gradient of a strictly convex potential function (like the negative entropy for the probability simplex). The server aggregates these dual updates, ensuring the global model respects the underlying geometry of the problem, such as staying within a simplex for probability distributions or on a manifold.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Federated Mirror Descent is a generalization within the federated optimization landscape. These related terms define the specific algorithms, challenges, and techniques that operate within or alongside its framework.
Federated Averaging (FedAvg)
The foundational algorithm upon which most federated optimization is built. In each round, a server:
- Selects a subset of clients
- Sends the current global model to each
- Clients perform Local SGD on their data
- Clients send model updates back
- The server computes a weighted average of these updates to form a new global model. FedAvg assumes simple averaging in the model's parameter space, a special case of Federated Mirror Descent using Euclidean distance.
FedOpt Framework
A generalization of the server-side aggregation step in federated learning. Instead of a simple weighted average (as in FedAvg), FedOpt applies adaptive optimizer updates (like Adam, Yogi, or Adagrad) to the aggregated client gradients. FedAdam, FedYogi, and FedAdagrad are specific instantiations. Federated Mirror Descent provides the theoretical underpinning for these adaptive updates by defining the proper dual space for the aggregation.
Client Drift
A central challenge that Federated Mirror Descent helps mitigate. Client drift occurs when clients perform multiple local update steps on statistically heterogeneous (non-IID) data, causing their local models to diverge from the global optimum. This is exacerbated by standard FedAvg. Algorithms like SCAFFOLD (which uses control variates) and FedProx (which adds a proximal term) are direct solutions to client drift, and their updates can often be interpreted through the lens of Mirror Descent with a specific Bregman divergence.
Bregman Divergence
The mathematical core defining the "mirror" in Mirror Descent. A Bregman divergence (D_\psi(x, y)) measures the difference between points (x) and (y) using a convex function (\psi).
- For (\psi = \frac{1}{2}|\cdot|^2), the divergence is squared Euclidean distance, recovering standard gradient descent.
- For (\psi) as the negative entropy, the divergence is the Kullback–Leibler (KL) divergence, enabling updates on the probability simplex (e.g., for training softmax layers under constraints). Federated Mirror Descent's flexibility comes from choosing an appropriate (\psi) for the problem geometry.
FedProx
A federated optimization algorithm designed for statistical and systems heterogeneity. FedProx modifies the local client objective function by adding a proximal term (\frac{\mu}{2} |\theta - \theta^t|^2), which penalizes the local model (\theta) for straying too far from the global model (\theta^t). This explicitly counteracts client drift. FedProx can be viewed as an instance of Federated Mirror Descent where the Bregman divergence is the squared Euclidean distance, linking it directly to the broader theoretical framework.
Adaptive Federated Optimization
A class of algorithms that incorporate adaptive, per-parameter learning rates into the federated process. This includes server-side adaptivity (FedOpt algorithms) and client-side adaptivity. Federated Mirror Descent provides a unified view: adaptive methods like Adam implicitly use a Bregman divergence derived from the history of squared gradients. Understanding this connection allows for the design of new adaptive methods with better convergence guarantees in the non-Euclidean geometries common in federated learning (e.g., constrained optimization on devices).

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us