Federated Averaging (FedAvg) is the canonical iterative algorithm for decentralized machine learning, enabling a global model to be trained across distributed edge devices without centralizing raw data. Each selected client performs Local Stochastic Gradient Descent (Local SGD) on its private dataset for several epochs. The server then aggregates the resulting model updates via a weighted average, typically by the number of local training examples, to form a new global model for the next round.
Glossary
Federated Averaging (FedAvg)

What is Federated Averaging (FedAvg)?
Federated Averaging (FedAvg) is the foundational iterative optimization algorithm for federated learning, where a central server aggregates locally computed model updates from a subset of clients by taking a weighted average to produce a new global model.
The algorithm's core innovation is its communication efficiency, as clients transmit only model parameter updates, not data. It directly addresses statistical heterogeneity (non-IID data) and systems heterogeneity across clients by allowing variable local computation. FedAvg establishes the foundational pattern for more advanced techniques like FedProx for stability and FedOpt for adaptive server-side optimization, forming the basis for privacy-preserving, scalable distributed AI.
Key Characteristics of FedAvg
Federated Averaging (FedAvg) is the canonical optimization algorithm for federated learning. Its design is defined by several core mechanisms that enable decentralized training across heterogeneous devices while maintaining data privacy.
Iterative Averaging of Local Updates
FedAvg operates in synchronized communication rounds. In each round, a subset of clients receives the current global model, performs Local Stochastic Gradient Descent (Local SGD) for multiple epochs on their private data, and sends the resulting model update (or the full model) back to the server. The server then computes a weighted average of these updates, typically weighted by the number of training samples on each client, to produce a new global model. This iterative averaging approximates the gradient descent that would occur on a centralized dataset.
Handling of Statistical Heterogeneity (Non-IID Data)
A fundamental challenge FedAvg addresses is non-IID (Independent and Identically Distributed) data across clients. Real-world device data is inherently heterogeneous (e.g., different user typing habits, local photo libraries). FedAvg's robustness to this stems from performing multiple local update steps, allowing models to partially adapt to local distributions before aggregation. However, this can lead to client drift, where local models diverge from the global objective. Advanced variants like FedProx and SCAFFOLD introduce mechanisms to explicitly correct for this drift.
Partial Client Participation per Round
In practical deployments, it is infeasible and inefficient to involve all clients in every training round due to constraints like device availability, network connectivity, and battery life. FedAvg is designed for partial client participation, where the server samples a fraction of the total client population (e.g., 1-10%) in each round. This sampling is often probabilistic, sometimes weighted by client data volume. This characteristic is crucial for scalability and mirrors the real-world intermittency of edge devices.
Communication Efficiency Priority
The primary bottleneck in federated learning is often communication bandwidth, not computation. FedAvg is explicitly designed for communication efficiency by performing substantial local computation (many SGD steps) between each communication round. This reduces the total number of rounds required for convergence compared to sending gradients after every single batch. Further efficiency is achieved through techniques like gradient compression, quantization, and top-k sparsification, which can be layered on top of the core FedAvg protocol.
Decoupled Server and Client Optimization
FedAvg cleanly separates the optimization processes on the server and clients. The client's role is purely local model training via SGD. The server's role is purely aggregation via a simple weighted average. This decoupling allows for significant flexibility and innovation on both sides. For instance, the FedOpt framework generalizes the server's aggregation step to use adaptive optimizers like FedAdam or FedYogi instead of simple averaging. Similarly, clients can employ personalized techniques or different local optimizers.
Privacy by Architecture, Not by Default
FedAvg provides a foundational privacy-by-architecture benefit because raw training data never leaves the client device; only model updates are shared. However, these updates can potentially leak information about the underlying data. Therefore, FedAvg is typically combined with formal privacy-enhancing technologies (PETs) to provide rigorous guarantees. The most common augmentations are:
- Secure Aggregation: Cryptographic protocols that allow the server to compute the sum/average of client updates without inspecting any individual update.
- Differential Privacy: Adding calibrated noise to client updates before they are sent, providing a mathematical guarantee that the output does not reveal whether any individual's data was used in training.
FedAvg vs. Other Federated Optimization Algorithms
A technical comparison of Federated Averaging (FedAvg) against prominent alternative algorithms, highlighting key design features, convergence properties, and suitability for different federated learning challenges.
| Algorithmic Feature / Metric | FedAvg | FedProx | SCAFFOLD | FedOpt (e.g., FedAdam) |
|---|---|---|---|---|
Core Innovation | Weighted averaging of client model parameters | Proximal term in local objective to limit client drift | Control variates (variance reduction) to correct client drift | Adaptive server-side optimizer (e.g., Adam, Adagrad) |
Primary Goal | Foundation: Simple, communication-efficient aggregation | Stability with system & statistical heterogeneity | Fast convergence under data heterogeneity (non-IID) | Improved convergence on non-convex problems |
Handles Non-IID Data | ||||
Mitigates Client Drift | Partial (via adaptive server updates) | |||
Server Update Rule | Static weighted average: w = Σ (n_k / n) * w_k | Static weighted average of proximal-constrained updates | Static average with control variate correction: w = w - η * Σ Δ_k | Adaptive update: w = w - η_server * Optimizer(Σ Δ_k) |
Client-Side Computation Overhead | Baseline (Local SGD) | Low (proximal term calculation) | Low (maintains control variate state) | Baseline (Local SGD) |
Communication Cost per Round | Baseline (full model parameters) | Baseline (full model parameters) | ~2x Baseline (model + control variates) | Baseline (full model parameters) |
Convergence Speed (Typical vs. FedAvg on non-IID) | Baseline | Similar or slightly faster | Significantly faster | Faster, especially on complex models |
Theoretical Guarantees | Under convex & IID assumptions | Convergence with bounded heterogeneity | Strong convergence rates for non-IID data | Convergence with adaptive server methods |
Common Use Cases for Federated Averaging
Federated Averaging (FedAvg) is deployed in domains where data privacy is paramount, computational resources are distributed, and regulatory compliance restricts data centralization. These use cases highlight its practical implementation.
Frequently Asked Questions
Federated Averaging (FedAvg) is the foundational algorithm for decentralized machine learning. These questions address its core mechanics, practical challenges, and relationship to other optimization techniques.
Federated Averaging (FedAvg) is the canonical iterative optimization algorithm for federated learning, where a central server coordinates the training of a shared global model across a massive population of decentralized clients (e.g., mobile phones, IoT devices) without ever accessing their raw local data.
It works through repeated communication rounds:
- Server Broadcast: The central server selects a subset of available clients and sends the current global model parameters to them.
- Local Training: Each selected client performs multiple epochs of Local Stochastic Gradient Descent (Local SGD) on its own private dataset, starting from the global model.
- Update Transmission: Clients send their locally updated model parameters (or gradients) back to the server.
- Secure Aggregation: The server computes a weighted average of the received client models to produce a new global model. The weight for each client is typically proportional to its local dataset size. This aggregation step is the core 'averaging' operation.
The process repeats until the global model converges. This architecture provides a fundamental privacy guarantee: sensitive training data never leaves the client device.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Federated Averaging (FedAvg) is the foundational algorithm, but modern federated learning relies on a rich ecosystem of specialized techniques to handle data heterogeneity, system constraints, and privacy requirements.
FedProx
FedProx is a federated optimization algorithm designed to handle statistical and systems heterogeneity. It modifies the local client's objective function by adding a proximal term that penalizes the distance between the local model and the current global model. This constraint prevents client drift, stabilizes training, and improves convergence when clients have varying computational capabilities or non-IID data distributions. It is a direct, robust enhancement to the standard FedAvg procedure.
SCAFFOLD
SCAFFOLD (Stochastic Controlled Averaging) tackles the fundamental problem of client drift caused by data heterogeneity. It introduces control variates—correction terms stored on both the server and each client. These variates estimate the difference between the client's local gradient and the global gradient direction. By applying this correction during local training, SCAFFOLD achieves significantly faster convergence and better final accuracy than FedAvg on highly non-IID data, albeit with increased communication cost for the control states.
FedOpt Framework
The FedOpt framework generalizes the server-side aggregation step of FedAvg. Instead of a simple weighted average of client updates, FedOpt applies an adaptive optimizer (like Adam, Yogi, or Adagrad) to the aggregated client gradients on the server. This allows the global model update to incorporate momentum, per-parameter learning rates, and other advanced optimization dynamics. FedAdam, FedYogi, and FedAdagrad are specific instantiations of this framework, often leading to improved performance on complex, non-convex models.
Local SGD (Client-Side Training)
Local Stochastic Gradient Descent (Local SGD) is the core training procedure executed by each client in a federated learning round. Instead of performing a single gradient step, each selected client runs multiple epochs of SGD on its local dataset. This communication-computation trade-off is central to FedAvg's efficiency: it reduces the frequency of communication (saving bandwidth) at the cost of potential client drift. The number of local steps is a critical hyperparameter balancing convergence speed and final model quality.
Gradient Compression
Gradient compression is a suite of techniques to reduce the communication bottleneck in federated learning. Instead of sending full-precision model updates, clients compress their gradients before transmission. Key methods include:
- Quantization: Mapping 32-bit floats to lower-bit representations (e.g., 8-bit).
- Sparsification: Transmitting only the most significant gradient values (e.g., Top-k Sparsification).
- Error Feedback: A crucial companion technique that accumulates compression error locally and adds it to the next gradient, preserving convergence guarantees despite the lossy compression.
Asynchronous Federated Learning
Asynchronous Federated Optimization departs from the synchronized round structure of FedAvg. In this paradigm, the central server updates the global model immediately upon receiving an update from any client, without waiting for a fixed cohort. Algorithms like FedAsync handle stale updates from slow clients by applying a mixing hyperparameter that decays with the update's age. This approach improves system efficiency in highly heterogeneous environments where device availability and connectivity vary dramatically, at the cost of more complex convergence analysis.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us