Glossary

FedYogi

FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.

Get in touch Learn more

Finance analyst reviewing cash flow AI optimization on laptop, charts and projections visible, home office work session.

FEDERATED OPTIMIZATION TECHNIQUE

What is FedYogi?

FedYogi is a federated adaptive optimization algorithm designed for stable and efficient server-side aggregation in decentralized machine learning.

FedYogi is a federated optimization algorithm within the FedOpt framework that adapts the Yogi adaptive optimizer for server-side model aggregation. It modifies the standard Federated Averaging (FedAvg) update by applying a per-parameter adaptive learning rate to the aggregated client updates, which is particularly effective for training on non-convex objectives common in deep learning. This approach provides more stable convergence than other adaptive methods like FedAdam, especially when client gradients are noisy or statistically heterogeneous (non-IID).

The algorithm's key innovation is its use of the Yogi update rule, which adjusts the adaptive learning rate's momentum term more conservatively than Adam when encountering large gradient magnitudes. This prevents rapid, unstable growth of the learning rate's denominator, leading to smoother optimization trajectories. FedYogi is therefore a preferred choice in federated learning scenarios where client data distributions vary significantly and local updates introduce high variance, as it robustly maintains convergence speed without compromising stability.

FEDERATED OPTIMIZATION TECHNIQUE

Key Features of FedYogi

Adaptive Server-Side Aggregation

FedYogi's core innovation is applying an adaptive optimizer directly on the server during the model aggregation step. Instead of performing a simple weighted average of client updates (like FedAvg), the server treats the aggregated client gradient as a pseudo-gradient and applies the Yogi update rule. This adapts the global model's learning rate per parameter based on past update magnitudes, leading to more stable and efficient convergence, especially for non-convex loss landscapes common in deep learning.

Robustness to Noisy Gradients

A primary advantage over FedAdam is FedYogi's inherent robustness to stochastic noise in client updates. In federated learning, client gradients can be highly variable due to:

Non-IID Data: Statistically heterogeneous data across devices.
Partial Client Participation: Only a subset of devices participates each round.
Local SGD Variance: Multiple local training steps amplify client-specific noise.

Yogi's update rule uses an adaptive denominator that prevents rapid decay of the effective learning rate when gradient estimates are noisy, which helps maintain progress and prevents premature convergence to suboptimal points.

The Yogi Update Rule

The server update for a model parameter (\theta) at round (t) is defined by the Yogi optimizer. Let (g_t) be the aggregated client gradient (pseudo-gradient), (m_t) the first moment (biased estimate), and (v_t) the second moment (adaptive term). The key difference from Adam is in the (v_t) update:

[ v_t = v_{t-1} - (1 - \beta_2) \cdot \text{sign}(v_{t-1} - g_t^2) \cdot g_t^2 ]

This sign-based adaptation ensures (v_t) only increases, preventing it from collapsing to zero when (g_t^2) is small relative to (v_{t-1}). This leads to a more conservative and stable adjustment of the per-parameter learning rate (\eta / (\sqrt{v_t} + \epsilon)), making it less sensitive to outlier gradient estimates.

Comparison to FedAdam

FedYogi is a direct alternative within the FedOpt framework. The critical distinction lies in the second moment estimator ((v_t)):

FedAdam: Uses the Adam update, where (v_t) is an exponentially moving average (EMA): (v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2). This can cause (v_t) to decrease rapidly if gradients become small, potentially leading to instability.
FedYogi: Uses the Yogi update, which is an additive increase rather than a moving average. This provides a "floor" for (v_t), preventing the effective learning rate from exploding and offering more predictable convergence, particularly in later training stages or with high client heterogeneity.

Hyperparameter Tuning & Stability

While adaptive, FedYogi introduces specific hyperparameters that require tuning for optimal performance:

\beta_1, \beta_2: Decay rates for the first and second moment estimates (typical values: 0.9, 0.999).
\tau: A crucial stabilization parameter that scales the client's pseudo-gradient before the server applies Yogi. It acts as a server learning rate.
\epsilon: A small constant for numerical stability.

Empirical studies show FedYogi often requires less aggressive tuning of (\tau) compared to the server learning rate in FedAdam to achieve stable training, making it somewhat more user-friendly in production federated systems.

Practical Applications & Use Cases

FedYogi is particularly well-suited for federated learning scenarios characterized by:

High Client Heterogeneity: Environments with significant variation in local data distributions (non-IID).
Unreliable or Noisy Networks: Where client updates may be corrupted or delayed.
Cross-Device FL with Massive Participation: Involving thousands of mobile or IoT devices with sporadic connectivity.
Training Deep Neural Networks: Where the loss landscape is complex and non-convex.

It is commonly implemented in federated learning frameworks like TensorFlow Federated (TFF) and Flower as a standard server optimizer option, providing a robust alternative to FedAvg and FedAdam for challenging real-world deployments.

SERVER-SIDE OPTIMIZER COMPARISON

FedYogi vs. FedAdam vs. FedAvg

This table compares the core mechanisms, convergence properties, and practical considerations of three foundational server-side aggregation algorithms in federated learning.

Feature / Mechanism	FedAvg (Baseline)	FedAdam	FedYogi
Core Server Update Rule	Weighted average of client deltas: w ← w + η_global * Δ	Applies Adam to aggregated client deltas: w ← w + η_global * (Adam(Δ))	Applies Yogi to aggregated client deltas: w ← w + η_global * (Yogi(Δ))
Adaptive Learning Rate
Momentum (First Moment)		Exponential moving average (β₁)	Exponential moving average (β₁)
Adaptive Second Moment		Exponential moving average (β₂). v ← β₂·v + (1-β₂)·Δ²	Adapts via additive/multiplicative correction. v ← v - (1-β₂)·Δ²·sign(v - Δ²)
Primary Design Goal	Communication efficiency via local SGD	Faster convergence on non-convex problems via adaptive server updates	Stable convergence under noisy or heterogeneous client gradients
Key Hyperparameters	Global learning rate (η_global), client fraction, local epochs	η_global, β₁, β₂, ε (for numerical stability)	η_global, β₁, β₂, ε, τ (initial accumulator value)
Robustness to Client Noise/ Heterogeneity	Low. Prone to client drift.	Moderate. Can be sensitive to aggressive variance adaptation.	High. Yogi's adaptive second moment prevents rapid variance collapse.
Typical Convergence Speed (vs. FedAvg)	Baseline	Faster	Comparable or faster, with greater stability
Communication Cost per Round	Identical (transmits model deltas/weights)	Identical (transmits model deltas/weights)	Identical (transmits model deltas/weights)
Server-Side Compute Overhead	Minimal (simple averaging)	Moderate (maintains and updates moment vectors)	Moderate (maintains and updates moment vectors)
Theoretical Guarantees	Well-studied under convex and non-convex assumptions	Convergence under non-convex objectives with adaptive rates	Convergence with provable adaptivity, robust to gradient noise

FEDERATED OPTIMIZATION TECHNIQUES

Frameworks and Implementations

Core Algorithm & Server-Side Update

FedYogi is a server-side adaptive optimizer within the FedOpt framework. Instead of a simple weighted average (FedAvg), the server maintains adaptive per-parameter learning rates. The update rule for a global model parameter (\theta_t) is:

(m_t = \beta_1 m_{t-1} + (1-\beta_1) \Delta_t) (First moment) (v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2)) (Yogi second moment) (\theta_{t+1} = \theta_t + \eta \cdot m_t / (\sqrt{v_t} + \epsilon))

Where (\Delta_t) is the aggregated client update. The key innovation is the Yogi second moment update, which prevents rapid decay of the adaptive learning rate.

The Yogi Moment Update for Stability

The defining feature is its adaptation of the Yogi optimizer's second moment estimation. Unlike FedAdam's Adam-style update ((v_t = \beta_2 v_{t-1} + (1-\beta_2) \Delta_t^2)), Yogi uses:

(v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2))

When gradients are small ((\Delta_t^2 < v_{t-1})): The update is additive, similar to Adam.
When gradients are large/noisy ((\Delta_t^2 > v_{t-1})): The update becomes subtractive. This prevents the second moment (v_t) from exploding, which would cause the effective learning rate (\eta / \sqrt{v_t}) to collapse too quickly. This leads to more stable convergence and robustness to noisy or heterogeneous client updates.

Comparison with FedAdam and FedAdagrad

As part of the adaptive federated optimization family, FedYogi is designed to outperform its siblings under specific conditions:

vs. FedAdam: FedAdam can suffer from overly rapid decay of the learning rate when client gradients are large or noisy, potentially stalling convergence. FedYogi's moment update is more conservative, often leading to better final accuracy and training stability.
vs. FedAdagrad: FedAdagrad's learning rates are monotonically non-increasing, which can be too aggressive, causing learning to stop prematurely. FedYogi provides a more flexible adaptation.
Use Case: FedYogi is particularly recommended when client data is highly heterogeneous (non-IID) or when client sampling introduces significant variance in the aggregated update (\Delta_t).

Hyperparameters and Tuning

Effective use requires tuning key hyperparameters:

Server Learning Rate ((\eta)): Typically needs to be smaller than in FedAvg, often in the range of (0.001) to (0.01).
Momentum ((\beta_1)): Controls the first moment decay, standard value is (0.9).
Adaptivity ((\beta_2)): Controls the second moment decay, crucial for stability. Values like (0.99) or (0.999) are common.
Epsilon ((\epsilon)): A small constant (e.g., (10^{-3})) for numerical stability.
Client-Side Parameters: The number of local epochs and client learning rate remain critical, as they control client drift. FedYogi's server-side adaptivity can partially compensate for aggressive local training.

Implementation in Federated Learning Frameworks

FedYogi is implemented in major open-source FL frameworks, providing a production-ready optimizer:

TensorFlow Federated (TFF): Available as tff.learning.optimizers.build_yogi and integrated into the tff.learning.algorithms.build_weighted_fed_avg process.
Flower: Can be implemented as a custom Strategy by overriding the server-side aggregation and update logic.
PyTorch / PySyft: Requires implementing the server aggregation function that applies the Yogi update rule to the averaged client gradients.

These implementations handle the core logic of maintaining the first and second moment states on the server across federation rounds.

EXPLORE

Practical Considerations and Limitations

When to use FedYogi:

In cross-device FL with statistically heterogeneous data.
When client updates are expected to be noisy or high-variance.
For complex, non-convex models like deep neural networks.

Limitations and Trade-offs:

Increased Server Memory: The server must store two auxiliary state tensors (moments) per model parameter, doubling the memory footprint compared to FedAvg.
Hyperparameter Sensitivity: Performance gains are dependent on proper tuning of (\beta_2) and (\eta).
Communication Cost: The algorithm does not reduce communication overhead; it only changes the server's aggregation method. It is often paired with gradient compression techniques like quantization or sparsification for efficiency.

FEDYOGI

Frequently Asked Questions

FedYogi is a federated optimization algorithm that adapts the Yogi adaptive optimizer for the server-side aggregation step in federated learning. It operates within the FedOpt framework, where instead of performing a simple weighted average of client updates (as in Federated Averaging (FedAvg)), the server applies an adaptive optimizer to the aggregated client gradients. FedYogi specifically modifies the Yogi optimizer's update rule to handle the variance and potential noise inherent in federated client gradients. The server maintains adaptive per-parameter learning rates based on estimates of the first moment (mean) and second moment (variance) of the aggregated gradients. Its key mechanism is a more conservative update to the second moment estimate, which prevents rapid decay of the learning rate and provides more stable convergence, especially when client gradients are noisy or sparse.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

FedYogi

What is FedYogi?