Inferensys

Glossary

FedYogi

FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.
Finance analyst reviewing cash flow AI optimization on laptop, charts and projections visible, home office work session.
FEDERATED OPTIMIZATION TECHNIQUE

What is FedYogi?

FedYogi is a federated adaptive optimization algorithm designed for stable and efficient server-side aggregation in decentralized machine learning.

FedYogi is a federated optimization algorithm within the FedOpt framework that adapts the Yogi adaptive optimizer for server-side model aggregation. It modifies the standard Federated Averaging (FedAvg) update by applying a per-parameter adaptive learning rate to the aggregated client updates, which is particularly effective for training on non-convex objectives common in deep learning. This approach provides more stable convergence than other adaptive methods like FedAdam, especially when client gradients are noisy or statistically heterogeneous (non-IID).

The algorithm's key innovation is its use of the Yogi update rule, which adjusts the adaptive learning rate's momentum term more conservatively than Adam when encountering large gradient magnitudes. This prevents rapid, unstable growth of the learning rate's denominator, leading to smoother optimization trajectories. FedYogi is therefore a preferred choice in federated learning scenarios where client data distributions vary significantly and local updates introduce high variance, as it robustly maintains convergence speed without compromising stability.

FEDERATED OPTIMIZATION TECHNIQUE

Key Features of FedYogi

FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.

01

Adaptive Server-Side Aggregation

FedYogi's core innovation is applying an adaptive optimizer directly on the server during the model aggregation step. Instead of performing a simple weighted average of client updates (like FedAvg), the server treats the aggregated client gradient as a pseudo-gradient and applies the Yogi update rule. This adapts the global model's learning rate per parameter based on past update magnitudes, leading to more stable and efficient convergence, especially for non-convex loss landscapes common in deep learning.

02

Robustness to Noisy Gradients

A primary advantage over FedAdam is FedYogi's inherent robustness to stochastic noise in client updates. In federated learning, client gradients can be highly variable due to:

  • Non-IID Data: Statistically heterogeneous data across devices.
  • Partial Client Participation: Only a subset of devices participates each round.
  • Local SGD Variance: Multiple local training steps amplify client-specific noise.

Yogi's update rule uses an adaptive denominator that prevents rapid decay of the effective learning rate when gradient estimates are noisy, which helps maintain progress and prevents premature convergence to suboptimal points.

03

The Yogi Update Rule

The server update for a model parameter (\theta) at round (t) is defined by the Yogi optimizer. Let (g_t) be the aggregated client gradient (pseudo-gradient), (m_t) the first moment (biased estimate), and (v_t) the second moment (adaptive term). The key difference from Adam is in the (v_t) update:

[ v_t = v_{t-1} - (1 - \beta_2) \cdot \text{sign}(v_{t-1} - g_t^2) \cdot g_t^2 ]

This sign-based adaptation ensures (v_t) only increases, preventing it from collapsing to zero when (g_t^2) is small relative to (v_{t-1}). This leads to a more conservative and stable adjustment of the per-parameter learning rate (\eta / (\sqrt{v_t} + \epsilon)), making it less sensitive to outlier gradient estimates.

04

Comparison to FedAdam

FedYogi is a direct alternative within the FedOpt framework. The critical distinction lies in the second moment estimator ((v_t)):

  • FedAdam: Uses the Adam update, where (v_t) is an exponentially moving average (EMA): (v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2). This can cause (v_t) to decrease rapidly if gradients become small, potentially leading to instability.
  • FedYogi: Uses the Yogi update, which is an additive increase rather than a moving average. This provides a "floor" for (v_t), preventing the effective learning rate from exploding and offering more predictable convergence, particularly in later training stages or with high client heterogeneity.
05

Hyperparameter Tuning & Stability

While adaptive, FedYogi introduces specific hyperparameters that require tuning for optimal performance:

  • \beta_1, \beta_2: Decay rates for the first and second moment estimates (typical values: 0.9, 0.999).
  • \tau: A crucial stabilization parameter that scales the client's pseudo-gradient before the server applies Yogi. It acts as a server learning rate.
  • \epsilon: A small constant for numerical stability.

Empirical studies show FedYogi often requires less aggressive tuning of (\tau) compared to the server learning rate in FedAdam to achieve stable training, making it somewhat more user-friendly in production federated systems.

06

Practical Applications & Use Cases

FedYogi is particularly well-suited for federated learning scenarios characterized by:

  • High Client Heterogeneity: Environments with significant variation in local data distributions (non-IID).
  • Unreliable or Noisy Networks: Where client updates may be corrupted or delayed.
  • Cross-Device FL with Massive Participation: Involving thousands of mobile or IoT devices with sporadic connectivity.
  • Training Deep Neural Networks: Where the loss landscape is complex and non-convex.

It is commonly implemented in federated learning frameworks like TensorFlow Federated (TFF) and Flower as a standard server optimizer option, providing a robust alternative to FedAvg and FedAdam for challenging real-world deployments.

SERVER-SIDE OPTIMIZER COMPARISON

FedYogi vs. FedAdam vs. FedAvg

This table compares the core mechanisms, convergence properties, and practical considerations of three foundational server-side aggregation algorithms in federated learning.

Feature / MechanismFedAvg (Baseline)FedAdamFedYogi

Core Server Update Rule

Weighted average of client deltas: w ← w + η_global * Δ

Applies Adam to aggregated client deltas: w ← w + η_global * (Adam(Δ))

Applies Yogi to aggregated client deltas: w ← w + η_global * (Yogi(Δ))

Adaptive Learning Rate

Momentum (First Moment)

Exponential moving average (β₁)

Exponential moving average (β₁)

Adaptive Second Moment

Exponential moving average (β₂). v ← β₂·v + (1-β₂)·Δ²

Adapts via additive/multiplicative correction. v ← v - (1-β₂)·Δ²·sign(v - Δ²)

Primary Design Goal

Communication efficiency via local SGD

Faster convergence on non-convex problems via adaptive server updates

Stable convergence under noisy or heterogeneous client gradients

Key Hyperparameters

Global learning rate (η_global), client fraction, local epochs

η_global, β₁, β₂, ε (for numerical stability)

η_global, β₁, β₂, ε, τ (initial accumulator value)

Robustness to Client Noise/ Heterogeneity

Low. Prone to client drift.

Moderate. Can be sensitive to aggressive variance adaptation.

High. Yogi's adaptive second moment prevents rapid variance collapse.

Typical Convergence Speed (vs. FedAvg)

Baseline

Faster

Comparable or faster, with greater stability

Communication Cost per Round

Identical (transmits model deltas/weights)

Identical (transmits model deltas/weights)

Identical (transmits model deltas/weights)

Server-Side Compute Overhead

Minimal (simple averaging)

Moderate (maintains and updates moment vectors)

Moderate (maintains and updates moment vectors)

Theoretical Guarantees

Well-studied under convex and non-convex assumptions

Convergence under non-convex objectives with adaptive rates

Convergence with provable adaptivity, robust to gradient noise

FEDERATED OPTIMIZATION TECHNIQUES

Frameworks and Implementations

FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.

01

Core Algorithm & Server-Side Update

FedYogi is a server-side adaptive optimizer within the FedOpt framework. Instead of a simple weighted average (FedAvg), the server maintains adaptive per-parameter learning rates. The update rule for a global model parameter (\theta_t) is:

(m_t = \beta_1 m_{t-1} + (1-\beta_1) \Delta_t) (First moment) (v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2)) (Yogi second moment) (\theta_{t+1} = \theta_t + \eta \cdot m_t / (\sqrt{v_t} + \epsilon))

Where (\Delta_t) is the aggregated client update. The key innovation is the Yogi second moment update, which prevents rapid decay of the adaptive learning rate.

02

The Yogi Moment Update for Stability

The defining feature is its adaptation of the Yogi optimizer's second moment estimation. Unlike FedAdam's Adam-style update ((v_t = \beta_2 v_{t-1} + (1-\beta_2) \Delta_t^2)), Yogi uses:

(v_t = v_{t-1} - (1-\beta_2) \Delta_t^2 \cdot \text{sign}(v_{t-1} - \Delta_t^2))

  • When gradients are small ((\Delta_t^2 < v_{t-1})): The update is additive, similar to Adam.
  • When gradients are large/noisy ((\Delta_t^2 > v_{t-1})): The update becomes subtractive. This prevents the second moment (v_t) from exploding, which would cause the effective learning rate (\eta / \sqrt{v_t}) to collapse too quickly. This leads to more stable convergence and robustness to noisy or heterogeneous client updates.
03

Comparison with FedAdam and FedAdagrad

As part of the adaptive federated optimization family, FedYogi is designed to outperform its siblings under specific conditions:

  • vs. FedAdam: FedAdam can suffer from overly rapid decay of the learning rate when client gradients are large or noisy, potentially stalling convergence. FedYogi's moment update is more conservative, often leading to better final accuracy and training stability.
  • vs. FedAdagrad: FedAdagrad's learning rates are monotonically non-increasing, which can be too aggressive, causing learning to stop prematurely. FedYogi provides a more flexible adaptation.
  • Use Case: FedYogi is particularly recommended when client data is highly heterogeneous (non-IID) or when client sampling introduces significant variance in the aggregated update (\Delta_t).
04

Hyperparameters and Tuning

Effective use requires tuning key hyperparameters:

  • Server Learning Rate ((\eta)): Typically needs to be smaller than in FedAvg, often in the range of (0.001) to (0.01).
  • Momentum ((\beta_1)): Controls the first moment decay, standard value is (0.9).
  • Adaptivity ((\beta_2)): Controls the second moment decay, crucial for stability. Values like (0.99) or (0.999) are common.
  • Epsilon ((\epsilon)): A small constant (e.g., (10^{-3})) for numerical stability.
  • Client-Side Parameters: The number of local epochs and client learning rate remain critical, as they control client drift. FedYogi's server-side adaptivity can partially compensate for aggressive local training.
06

Practical Considerations and Limitations

When to use FedYogi:

  • In cross-device FL with statistically heterogeneous data.
  • When client updates are expected to be noisy or high-variance.
  • For complex, non-convex models like deep neural networks.

Limitations and Trade-offs:

  • Increased Server Memory: The server must store two auxiliary state tensors (moments) per model parameter, doubling the memory footprint compared to FedAvg.
  • Hyperparameter Sensitivity: Performance gains are dependent on proper tuning of (\beta_2) and (\eta).
  • Communication Cost: The algorithm does not reduce communication overhead; it only changes the server's aggregation method. It is often paired with gradient compression techniques like quantization or sparsification for efficiency.
FEDYOGI

Frequently Asked Questions

FedYogi is a federated adaptive optimization algorithm that adapts the Yogi optimizer for server-side aggregation, offering more stable convergence than FedAdam, particularly in the presence of noisy client gradients.

FedYogi is a federated optimization algorithm that adapts the Yogi adaptive optimizer for the server-side aggregation step in federated learning. It operates within the FedOpt framework, where instead of performing a simple weighted average of client updates (as in Federated Averaging (FedAvg)), the server applies an adaptive optimizer to the aggregated client gradients. FedYogi specifically modifies the Yogi optimizer's update rule to handle the variance and potential noise inherent in federated client gradients. The server maintains adaptive per-parameter learning rates based on estimates of the first moment (mean) and second moment (variance) of the aggregated gradients. Its key mechanism is a more conservative update to the second moment estimate, which prevents rapid decay of the learning rate and provides more stable convergence, especially when client gradients are noisy or sparse.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.