FedYogi is a federated optimization algorithm within the FedOpt framework that adapts the Yogi adaptive optimizer for server-side model aggregation. It modifies the standard Federated Averaging (FedAvg) update by applying a per-parameter adaptive learning rate to the aggregated client updates, which is particularly effective for training on non-convex objectives common in deep learning. This approach provides more stable convergence than other adaptive methods like FedAdam, especially when client gradients are noisy or statistically heterogeneous (non-IID).
Glossary
FedYogi

What is FedYogi?
FedYogi is a federated adaptive optimization algorithm designed for stable and efficient server-side aggregation in decentralized machine learning.
The algorithm's key innovation is its use of the Yogi update rule, which adjusts the adaptive learning rate's momentum term more conservatively than Adam when encountering large gradient magnitudes. This prevents rapid, unstable growth of the learning rate's denominator, leading to smoother optimization trajectories. FedYogi is therefore a preferred choice in federated learning scenarios where client data distributions vary significantly and local updates introduce high variance, as it robustly maintains convergence speed without compromising stability.
Key Features of FedYogi
FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.
Adaptive Server-Side Aggregation
FedYogi's core innovation is applying an adaptive optimizer directly on the server during the model aggregation step. Instead of performing a simple weighted average of client updates (like FedAvg), the server treats the aggregated client gradient as a pseudo-gradient and applies the Yogi update rule. This adapts the global model's learning rate per parameter based on past update magnitudes, leading to more stable and efficient convergence, especially for non-convex loss landscapes common in deep learning.
Robustness to Noisy Gradients
A primary advantage over FedAdam is FedYogi's inherent robustness to stochastic noise in client updates. In federated learning, client gradients can be highly variable due to:
- Non-IID Data: Statistically heterogeneous data across devices.
- Partial Client Participation: Only a subset of devices participates each round.
- Local SGD Variance: Multiple local training steps amplify client-specific noise.
Yogi's update rule uses an adaptive denominator that prevents rapid decay of the effective learning rate when gradient estimates are noisy, which helps maintain progress and prevents premature convergence to suboptimal points.
The Yogi Update Rule
The server update for a model parameter (\theta) at round (t) is defined by the Yogi optimizer. Let (g_t) be the aggregated client gradient (pseudo-gradient), (m_t) the first moment (biased estimate), and (v_t) the second moment (adaptive term). The key difference from Adam is in the (v_t) update:
[ v_t = v_{t-1} - (1 - \beta_2) \cdot \text{sign}(v_{t-1} - g_t^2) \cdot g_t^2 ]
This sign-based adaptation ensures (v_t) only increases, preventing it from collapsing to zero when (g_t^2) is small relative to (v_{t-1}). This leads to a more conservative and stable adjustment of the per-parameter learning rate (\eta / (\sqrt{v_t} + \epsilon)), making it less sensitive to outlier gradient estimates.
Comparison to FedAdam
FedYogi is a direct alternative within the FedOpt framework. The critical distinction lies in the second moment estimator ((v_t)):
- FedAdam: Uses the Adam update, where (v_t) is an exponentially moving average (EMA): (v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2). This can cause (v_t) to decrease rapidly if gradients become small, potentially leading to instability.
- FedYogi: Uses the Yogi update, which is an additive increase rather than a moving average. This provides a "floor" for (v_t), preventing the effective learning rate from exploding and offering more predictable convergence, particularly in later training stages or with high client heterogeneity.
Hyperparameter Tuning & Stability
While adaptive, FedYogi introduces specific hyperparameters that require tuning for optimal performance:
- \beta_1, \beta_2: Decay rates for the first and second moment estimates (typical values: 0.9, 0.999).
- \tau: A crucial stabilization parameter that scales the client's pseudo-gradient before the server applies Yogi. It acts as a server learning rate.
- \epsilon: A small constant for numerical stability.
Empirical studies show FedYogi often requires less aggressive tuning of (\tau) compared to the server learning rate in FedAdam to achieve stable training, making it somewhat more user-friendly in production federated systems.
Practical Applications & Use Cases
FedYogi is particularly well-suited for federated learning scenarios characterized by:
- High Client Heterogeneity: Environments with significant variation in local data distributions (non-IID).
- Unreliable or Noisy Networks: Where client updates may be corrupted or delayed.
- Cross-Device FL with Massive Participation: Involving thousands of mobile or IoT devices with sporadic connectivity.
- Training Deep Neural Networks: Where the loss landscape is complex and non-convex.
It is commonly implemented in federated learning frameworks like TensorFlow Federated (TFF) and Flower as a standard server optimizer option, providing a robust alternative to FedAvg and FedAdam for challenging real-world deployments.
FedYogi vs. FedAdam vs. FedAvg
This table compares the core mechanisms, convergence properties, and practical considerations of three foundational server-side aggregation algorithms in federated learning.
| Feature / Mechanism | FedAvg (Baseline) | FedAdam | FedYogi |
|---|---|---|---|
Core Server Update Rule | Weighted average of client deltas: w ← w + η_global * Δ | Applies Adam to aggregated client deltas: w ← w + η_global * (Adam(Δ)) | Applies Yogi to aggregated client deltas: w ← w + η_global * (Yogi(Δ)) |
Adaptive Learning Rate | |||
Momentum (First Moment) | Exponential moving average (β₁) | Exponential moving average (β₁) | |
Adaptive Second Moment | Exponential moving average (β₂). v ← β₂·v + (1-β₂)·Δ² | Adapts via additive/multiplicative correction. v ← v - (1-β₂)·Δ²·sign(v - Δ²) | |
Primary Design Goal | Communication efficiency via local SGD | Faster convergence on non-convex problems via adaptive server updates | Stable convergence under noisy or heterogeneous client gradients |
Key Hyperparameters | Global learning rate (η_global), client fraction, local epochs | η_global, β₁, β₂, ε (for numerical stability) | η_global, β₁, β₂, ε, τ (initial accumulator value) |
Robustness to Client Noise/ Heterogeneity | Low. Prone to client drift. | Moderate. Can be sensitive to aggressive variance adaptation. | High. Yogi's adaptive second moment prevents rapid variance collapse. |
Typical Convergence Speed (vs. FedAvg) | Baseline | Faster | Comparable or faster, with greater stability |
Communication Cost per Round | Identical (transmits model deltas/weights) | Identical (transmits model deltas/weights) | Identical (transmits model deltas/weights) |
Server-Side Compute Overhead | Minimal (simple averaging) | Moderate (maintains and updates moment vectors) | Moderate (maintains and updates moment vectors) |
Theoretical Guarantees | Well-studied under convex and non-convex assumptions | Convergence under non-convex objectives with adaptive rates | Convergence with provable adaptivity, robust to gradient noise |
Frameworks and Implementations
FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.
Core Algorithm & Server-Side Update
FedYogi is a server-side adaptive optimizer within the FedOpt framework. Instead of a simple weighted average (FedAvg), the server maintains adaptive per-parameter learning rates. The update rule for a global model parameter (\theta_t) is:
(m_t = \beta_1 m_{t-1} + (1-\beta_1) \Delta_t) (First moment) (v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2)) (Yogi second moment) (\theta_{t+1} = \theta_t + \eta \cdot m_t / (\sqrt{v_t} + \epsilon))
Where (\Delta_t) is the aggregated client update. The key innovation is the Yogi second moment update, which prevents rapid decay of the adaptive learning rate.
The Yogi Moment Update for Stability
The defining feature is its adaptation of the Yogi optimizer's second moment estimation. Unlike FedAdam's Adam-style update ((v_t = \beta_2 v_{t-1} + (1-\beta_2) \Delta_t^2)), Yogi uses:
(v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2))
- When gradients are small ((\Delta_t^2 < v_{t-1})): The update is additive, similar to Adam.
- When gradients are large/noisy ((\Delta_t^2 > v_{t-1})): The update becomes subtractive. This prevents the second moment (v_t) from exploding, which would cause the effective learning rate (\eta / \sqrt{v_t}) to collapse too quickly. This leads to more stable convergence and robustness to noisy or heterogeneous client updates.
Comparison with FedAdam and FedAdagrad
As part of the adaptive federated optimization family, FedYogi is designed to outperform its siblings under specific conditions:
- vs. FedAdam: FedAdam can suffer from overly rapid decay of the learning rate when client gradients are large or noisy, potentially stalling convergence. FedYogi's moment update is more conservative, often leading to better final accuracy and training stability.
- vs. FedAdagrad: FedAdagrad's learning rates are monotonically non-increasing, which can be too aggressive, causing learning to stop prematurely. FedYogi provides a more flexible adaptation.
- Use Case: FedYogi is particularly recommended when client data is highly heterogeneous (non-IID) or when client sampling introduces significant variance in the aggregated update (\Delta_t).
Hyperparameters and Tuning
Effective use requires tuning key hyperparameters:
- Server Learning Rate ((\eta)): Typically needs to be smaller than in FedAvg, often in the range of (0.001) to (0.01).
- Momentum ((\beta_1)): Controls the first moment decay, standard value is (0.9).
- Adaptivity ((\beta_2)): Controls the second moment decay, crucial for stability. Values like (0.99) or (0.999) are common.
- Epsilon ((\epsilon)): A small constant (e.g., (10^{-3})) for numerical stability.
- Client-Side Parameters: The number of local epochs and client learning rate remain critical, as they control client drift. FedYogi's server-side adaptivity can partially compensate for aggressive local training.
Practical Considerations and Limitations
When to use FedYogi:
- In cross-device FL with statistically heterogeneous data.
- When client updates are expected to be noisy or high-variance.
- For complex, non-convex models like deep neural networks.
Limitations and Trade-offs:
- Increased Server Memory: The server must store two auxiliary state tensors (moments) per model parameter, doubling the memory footprint compared to FedAvg.
- Hyperparameter Sensitivity: Performance gains are dependent on proper tuning of (\beta_2) and (\eta).
- Communication Cost: The algorithm does not reduce communication overhead; it only changes the server's aggregation method. It is often paired with gradient compression techniques like quantization or sparsification for efficiency.
Frequently Asked Questions
FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.
FedYogi is a federated optimization algorithm that adapts the Yogi adaptive optimizer for the server-side aggregation step in federated learning. It operates within the FedOpt framework, where instead of performing a simple weighted average of client updates (as in Federated Averaging (FedAvg)), the server applies an adaptive optimizer to the aggregated client gradients. FedYogi specifically modifies the Yogi optimizer's update rule to handle the variance and potential noise inherent in federated client gradients. The server maintains adaptive per-parameter learning rates based on estimates of the first moment (mean) and second moment (variance) of the aggregated gradients. Its key mechanism is a more conservative update to the second moment estimate, which prevents rapid decay of the learning rate and provides more stable convergence, especially when client gradients are noisy or sparse.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
FedYogi is part of a broader family of algorithms designed to solve the unique optimization challenges of federated learning. These related concepts address client heterogeneity, communication efficiency, and adaptive server-side updates.
Adaptive Optimization
Adaptive optimization refers to gradient-based methods that automatically adjust the learning rate for each model parameter. Key characteristics include:
- Per-parameter learning rates based on historical gradient information.
- Momentum to accelerate convergence in relevant directions.
- Examples: Adam, Adagrad, RMSprop, and Yogi. In federated learning, these methods are adapted for server-side aggregation (FedOpt) to handle the non-stationary and heterogeneous update stream from clients.
Client Drift
Client drift is a fundamental challenge in federated learning where local models diverge from the global objective. This occurs because clients perform multiple Local SGD steps on their statistically heterogeneous (non-IID) data. The resulting local updates become biased estimates of the true global gradient. Algorithms like FedYogi, FedProx, and SCAFFOLD are designed to mitigate client drift, with FedYogi addressing it through more stable, adaptive server-side aggregation that is less sensitive to the variance in client updates.
Server-Side Learning Rate
In the FedOpt framework, the server-side learning rate (η) is a critical hyperparameter distinct from the clients' local learning rates. It controls the magnitude of the update applied to the global model after aggregating client contributions. For FedYogi, this parameter interacts with the adaptive moment estimates. A well-tuned server-side learning rate is essential for convergence; too high a value can cause instability, while too low slows progress. FedYogi's design aims to make convergence less sensitive to the exact tuning of this parameter compared to FedAdam.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us