SCAFFOLD is a federated optimization algorithm that introduces control variates—server and client correction terms—to counteract client drift. This drift occurs when local models diverge from the global objective due to training on non-IID data. By estimating and subtracting the discrepancy between local and global update directions, SCAFFOLD ensures clients perform consistent, unbiased steps toward the shared optimum, dramatically improving convergence speed over standard Federated Averaging (FedAvg).
Glossary
SCAFFOLD

What is SCAFFOLD?
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a foundational algorithm designed to correct client drift in federated learning, enabling faster and more stable convergence across statistically heterogeneous edge devices.
The algorithm operates by maintaining two states: a global control variate on the server and a local variate on each client. During each round, clients compute updates relative to these correction terms, which are then aggregated. This mechanism effectively reduces the variance in client updates caused by data heterogeneity. SCAFFOLD is particularly impactful in cross-silo and cross-device settings with high statistical heterogeneity, forming a basis for more advanced techniques like Federated SVRG and adaptive methods.
Key Features of SCAFFOLD
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a foundational algorithm designed to correct for client drift in heterogeneous data environments. Its core innovation is the use of control variates to align local and global optimization objectives.
Control Variates for Client Drift Correction
The central mechanism of SCAFFOLD is the use of control variates—vectors stored on both the server and each client. These variates estimate the difference between the client's local gradient and the global gradient direction.
- Client Control Variate (c_i): Tracks the bias of client i's local data distribution.
- Server Control Variate (c): Represents the global gradient direction.
During local training, the client's gradient is corrected by subtracting its local bias (
c_i) and adding the global direction (c). This explicitly counteracts the client drift caused by non-IID data, guiding local updates toward the global optimum.
Two-Way Synchronization Protocol
SCAFFOLD requires a bidirectional exchange of control variates, not just model weights. Each communication round involves:
- Server-to-Client: The server sends the global model
wand the global control variatec. - Local Correction: The client performs SGD on its local loss, but uses the corrected gradient:
gradient - c_i + c. - Client-to-Server: The client sends back its model update
Δw_iand an update to its control variateΔc_i. - Server Aggregation: The server averages the model updates and the control variate updates to produce new global states
wandc. This protocol ensures both the model and the estimate of client bias are continuously refined.
Theoretical Convergence Guarantees
SCAFFOLD provides strong theoretical convergence rates that are independent of data heterogeneity (client drift).
- For Smooth Non-Convex Problems: SCAFFOLD converges at a rate of
O(1 / (SN)), whereSis the number of communication rounds andNis the total number of client gradient steps. This is significantly faster than standard Federated Averaging (FedAvg), whose convergence can degrade severely with high data variance across clients. - Key Insight: By using control variates to reduce the variance in client updates, SCAFFOLD effectively transforms the heterogeneous federated problem into a more homogeneous optimization task, enabling the use of larger local steps (more local epochs) without causing divergence.
Comparison to FedAvg and FedProx
SCAFFOLD addresses the same core problem as FedProx—statistical heterogeneity—but with a fundamentally different, additive correction mechanism.
- vs. FedAvg: FedAvg has no explicit correction for client drift. Under high data heterogeneity, local models diverge, leading to slow, unstable convergence. SCAFFOLD's control variates actively correct this drift.
- vs. FedProx: FedProx adds a proximal term (
μ/2 * ||w - w^t||^2) to the local objective, which acts as a soft penalty to prevent the client model from straying too far from the global model. SCAFFOLD, conversely, uses an additive correction (- c_i + c) to the gradient itself, directly steering the update direction. In practice, SCAFFOLD often achieves faster convergence than FedProx.
Practical Considerations and Overhead
Implementing SCAFFOLD introduces specific system trade-offs:
- Communication Overhead: Doubles the per-client communication cost, as both model updates (
Δw_i) and control variate updates (Δc_i) must be transmitted. This is a key consideration versus gradient compression techniques. - Client State: Each client must persistently store its local control variate
c_iacross rounds. This requires stable client participation or a state recovery mechanism for dropping clients. - Computation: The local correction step is computationally trivial (vector addition), adding negligible overhead compared to the forward/backward passes of the model itself.
- Use Case: The algorithm is most beneficial in cross-silo federated learning (e.g., between hospitals or banks) where data is highly heterogeneous, client participation is stable, and the communication overhead is acceptable relative to the convergence speed gains.
Relation to Variance Reduction Methods
SCAFFOLD is conceptually linked to classic variance reduction techniques from centralized optimization, such as SVRG (Stochastic Variance Reduced Gradient).
- Shared Principle: Both methods use a stored reference point (a full gradient in SVRG, the server control variate
cin SCAFFOLD) to correct the variance of stochastic updates. - Federated Adaptation: SCAFFOLD adapts this principle to the federated constraint where the true global gradient is never computed. The server control variate
cserves as a running estimate. Federated SVRG is a related approach but often requires periodic computation of a full gradient across all clients, which is impractical in true federated settings. SCAFFOLD's design is more communication-efficient for ongoing federated training.
SCAFFOLD vs. Federated Averaging (FedAvg)
A technical comparison of the SCAFFOLD optimization algorithm against the foundational Federated Averaging (FedAvg) method, highlighting mechanisms for handling data heterogeneity.
| Feature / Mechanism | SCAFFOLD (Stochastic Controlled Averaging) | Federated Averaging (FedAvg) |
|---|---|---|
Core Innovation | Uses control variates (c_i, c) to correct for client drift | Simple weighted averaging of client model updates |
Primary Objective | Mitigate client drift caused by data heterogeneity (non-IID data) | Enable collaborative training via periodic model averaging |
Client-Side State | Maintains a personal control variate (c_i) and the global control variate (c) | Maintains only the local model parameters (w_i) |
Client Update Computation | Δw_i = -η_l (g_i - c_i + c); where g_i is the local stochastic gradient | Δw_i = -η_l * g_i; standard SGD step |
Server Aggregation Method | Averages model deltas (Δw_i) and control variate deltas (Δc_i) | Averages model parameters (w_i), weighted by local dataset size |
Communication Overhead per Round | Transmits both model delta (Δw_i) and control variate delta (Δc_i) | Transmits only the updated model parameters (w_i) |
Convergence Speed on Non-IID Data | Provably faster; O(1/ε) communication rounds to reach accuracy ε | Slower; can require significantly more rounds under high heterogeneity |
Theoretical Guarantees | Convergence proven for non-convex objectives under client sampling | Convergence proven under assumptions of IID or bounded heterogeneity |
Handling of Client Sampling | Robust; control variates correct bias from partial client participation | Sensitive; partial participation can introduce bias and slow convergence |
Typical Use Case | Environments with high statistical heterogeneity (e.g., different user behavior) | Environments with relatively homogeneous data or where simplicity is key |
Frequently Asked Questions
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a pivotal algorithm designed to overcome the fundamental challenge of client drift in federated optimization. These questions address its core mechanics, advantages, and practical implementation.
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a federated optimization algorithm that uses control variates—client-specific and server-side correction terms—to counteract client drift caused by data heterogeneity. It works by having each client maintain a local control variate that estimates the bias between its local stochastic gradient and the true global gradient direction. During each round, clients perform local Stochastic Gradient Descent (SGD) but correct their updates using the difference between their local control variate and a global server control variate. The server then aggregates these corrected updates and updates the global control variate, effectively reducing the variance in the update direction and aligning client optimizations with the global objective.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
SCAFFOLD operates within a broader ecosystem of algorithms designed to solve the core challenges of federated optimization. These related terms define the specific problems SCAFFOLD addresses and the alternative methodological approaches.
Client Drift
Client drift is the core problem SCAFFOLD is designed to solve. It is the phenomenon where local client models diverge from the global objective during multiple steps of Local SGD on statistically heterogeneous (non-IID) data. This divergence causes the aggregated global model to converge slowly or to a suboptimal solution. SCAFFOLD corrects for this drift using control variates.
- Cause: Performing many local epochs on data that is not representative of the global distribution.
- Effect: High variance in client updates, leading to unstable and slow global convergence.
- SCAFFOLD's Solution: Maintains a control variate for both server and clients to estimate and correct the update direction bias.
Federated Variance Reduction
Federated Variance Reduction is a class of techniques adapted from classical optimization (e.g., SVRG, SAGA) to reduce the variance of stochastic gradients in the federated setting. High variance, exacerbated by non-IID data, is a primary cause of slow convergence. SCAFFOLD is a prominent federated variance reduction method.
- Goal: Stabilize the optimization trajectory by reducing the noise in update directions.
- Classical Analogy: Similar to how SVRG uses a full-batch gradient snapshot as a control variate.
- SCAFFOLD's Approach: Employs personalized control variates stored on the server and each client to correct local gradient estimates, effectively reducing the variance introduced by data heterogeneity.
FedProx
FedProx is a federated optimization algorithm that addresses statistical and systems heterogeneity by adding a proximal term to the local client's objective function. This term penalizes the local model for deviating too far from the global model, directly combating client drift.
- Mechanism: Clients minimize
Local Loss + (μ/2) * ||local_model - global_model||^2. - Comparison to SCAFFOLD: Both mitigate client drift. FedProx uses a constraint-based method (proximal penalty), while SCAFFOLD uses a correction-based method (control variates).
- Use Case: Particularly effective when client capabilities vary widely (systems heterogeneity), as the μ parameter can be tuned per client.
Adaptive Federated Optimization (FedOpt)
Adaptive Federated Optimization (FedOpt) is a framework that generalizes the server-side aggregation step. Instead of simple averaging (FedAvg), it applies adaptive optimizer algorithms like Adam, Yogi, or Adagrad to the stream of client updates.
- Core Idea: Treat the aggregated client update as a pseudo-gradient and apply an adaptive optimizer on the server.
- FedAdam/FedYogi: Specific instantiations of FedOpt using Adam or Yogi.
- Relation to SCAFFOLD: SCAFFOLD and FedOpt are orthogonal and complementary. SCAFFOLD corrects the client-side updates using control variates, while FedOpt improves the server-side aggregation. They can be combined for further performance gains.
Local Stochastic Gradient Descent (Local SGD)
Local SGD is the fundamental client-side training procedure in federated learning. Each selected device performs multiple iterations (epochs) of Stochastic Gradient Descent on its local dataset before sending its model update to the server. This is the 'local' part of Federated Averaging (FedAvg).
- Key Parameter: Number of local epochs or local steps. More steps increase computation but also exacerbate client drift.
- SCAFFOLD's Interaction: SCAFFOLD modifies the standard Local SGD update rule. The client's gradient is corrected by the difference between its personal control variate and the server's control variate before applying the update.
- Formula (SCAFFOLD Client Update):
θ_client = θ_client - η * (gradient + c_server - c_client).
Federated Learning with Non-IID Data
This is the primary challenge setting for algorithms like SCAFFOLD. Non-IID (Independent and Identically Distributed) data refers to the statistical heterogeneity where the data distribution differs significantly across clients (e.g., different user writing styles, regional image types). This breaks the core assumptions of centralized SGD.
- Manifestations: Label distribution skew, feature distribution skew, or quantity skew.
- Consequences: Client drift, model bias, and slow/unstable convergence.
- Algorithmic Responses: SCAFFOLD, FedProx, and personalized FL methods are all designed to maintain performance under non-IID data conditions, which is the realistic default in federated systems.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us