SCAFFOLD is a federated optimization algorithm that uses control variates—client-specific and server-side correction terms—to reduce the variance in local stochastic gradient updates. This directly counteracts client drift, the divergence of local models caused by statistical heterogeneity across devices. By incorporating these corrections, SCAFFOLD ensures local updates are consistently aligned with the global objective, leading to faster and more stable convergence compared to basic algorithms like Federated Averaging (FedAvg).
Glossary
SCAFFOLD

What is SCAFFOLD?
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a foundational algorithm designed to correct for client drift in federated learning systems with non-IID data.
The algorithm maintains two sets of control variates: one on the server representing the global update direction, and one per client capturing its local data bias. During each communication round, clients adjust their gradients using the difference between these terms. This mechanism is particularly effective in cross-device FL scenarios with high data skew. SCAFFOLD's design addresses a core federated optimization challenge without relying on restrictive client-side constraints, making it a pivotal technique for robust on-device learning systems.
Key Features of SCAFFOLD
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is an advanced federated optimization algorithm designed to correct for client drift caused by data heterogeneity. Its core innovation is the use of control variates—correction terms stored on both the server and clients—to reduce the variance in client updates.
Control Variates for Variance Reduction
The central mechanism of SCAFFOLD is the introduction of control variates, which are correction terms that estimate the update direction of the global model. Each client maintains a local control variate (c_i), and the server maintains a global control variate (c). During local training, the client's update is adjusted by the difference between these terms (c - c_i), effectively subtracting the estimated client-specific bias. This variance reduction technique ensures client updates are more aligned with the global objective, dramatically improving convergence stability on non-IID data.
Mitigation of Client Drift
SCAFFOLD directly addresses the client drift problem, where models trained on statistically heterogeneous local data diverge from the optimal global solution. By using control variates to correct the local descent direction, SCAFFOLD prevents clients from overfitting to their local data distribution. This correction ensures that even with many local steps, the aggregated update points toward the true global gradient, unlike basic Federated Averaging (FedAvg) which suffers significant drift under high heterogeneity.
Server and Client State Synchronization
SCAFFOLD requires maintaining and synchronizing state between the server and clients. The algorithm's steps are:
- Server Initialization: The server initializes the global model and global control variate
c. - Client Update: Selected clients receive the global model and
c. They perform local SGD, using their local control variatec_ito correct gradients, then send back model deltas and updatedc_i. - Server Aggregation: The server aggregates model updates and updates the global control variate
cas a weighted average of the receivedc_i. This synchronized state management is crucial for the algorithm's corrective effect.
Advantages Over FedAvg and FedProx
SCAFFOLD provides theoretical and practical improvements over foundational algorithms:
- Vs. FedAvg: Provides provably faster convergence, especially under high data heterogeneity, by correcting for client drift rather than simply averaging potentially divergent models.
- Vs. FedProx: While FedProx adds a proximal term to constrain updates, SCAFFOLD uses an additive correction. SCAFFOLD often achieves better convergence rates and final accuracy without needing to tune a penalty hyperparameter. It is particularly effective when clients perform many local steps.
Application in On-Device Learning
SCAFFOLD's design is highly relevant for on-device learning scenarios on microcontrollers and edge devices. Its ability to converge efficiently with fewer communication rounds is critical for battery-powered, intermittently connected devices. The local control variate (c_i) acts as a compact summary of the device's data distribution, enabling effective personalization while still contributing to a robust global model. This makes it a candidate algorithm for federated edge learning systems where data privacy and communication efficiency are paramount.
Computational and Communication Overhead
The improved convergence of SCAFFOLD comes with trade-offs:
- Increased Client Memory: Clients must store their local control variate
c_i, which is the same size as the model's gradient vector, doubling the client-side state. - Increased Communication: Clients must transmit both the model update and their updated control variate
c_ito the server, increasing per-round communication cost by approximately 2x compared to sending only model weights. - Server Complexity: The server must maintain and update the global control variate
c. This overhead is often justified by the significant reduction in the number of communication rounds required to reach a target accuracy.
SCAFFOLD vs. FedAvg vs. FedProx
A technical comparison of core federated learning algorithms designed to address the challenges of statistical heterogeneity (non-IID data) and client drift.
| Algorithmic Feature / Mechanism | SCAFFOLD (Stochastic Controlled Averaging) | FedAvg (Federated Averaging) | FedProx |
|---|---|---|---|
Primary Innovation | Control variates (client & server correction terms) to reduce update variance | Weighted averaging of client model parameters after local SGD | Proximal term added to local objective to constrain client drift |
Core Objective | Correct for client drift by estimating update direction bias | Minimize communication cost via multiple local epochs | Handle system & statistical heterogeneity via constrained optimization |
Key Mechanism for Heterogeneity | Tracks and corrects the difference between client and server update directions | Relies on averaging; performance degrades significantly under high heterogeneity | Penalizes local updates that stray too far from the global model |
Client-Server Communication | Client sends model update + control variate delta; Server maintains its own control variate | Client sends updated model parameters (or gradients); Server performs averaging | Client sends updated model parameters; Server performs averaging (identical to FedAvg) |
Handles Non-IID Data | Excellent. Explicitly designed for and robust to high statistical heterogeneity. | Poor. Suffers from significant client drift and slow, unstable convergence. | Good. Proximal term mitigates drift, improving stability and convergence. |
Convergence Guarantees | Strong theoretical convergence for both IID and non-IID data, independent of data heterogeneity. | Convergence guarantees typically assume IID or bounded dissimilarity; weak under high heterogeneity. | Convergence guarantees with a dissimilarity measure; more robust than FedAvg under heterogeneity. |
Client-Side Computation Overhead | Moderate. Requires storing and updating a personal control variate. | Low. Standard local SGD steps. | Low to Moderate. Requires computing the proximal term (L2 distance to global model). |
Server-Side Computation Overhead | Moderate. Must maintain and update a server control variate. | Low. Simple weighted averaging. | Low. Simple weighted averaging (identical to FedAvg). |
Privacy Implication | Control variates may potentially leak additional information about client update direction, though not raw data. | Standard FL privacy; relies on secure aggregation and DP for formal guarantees. | Standard FL privacy; identical to FedAvg. |
Typical Use Case | Cross-silo and cross-device with severe data skew, where convergence quality is critical. | Cross-device with relatively homogeneous data or large number of participants (e.g., mobile keyboard). | Cross-device with system heterogeneity (stragglers) and moderate statistical heterogeneity. |
SCAFFOLD Use Cases in TinyML & On-Device Learning
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a foundational algorithm designed to correct for client drift in heterogeneous data environments. Its core mechanism—using control variates—is uniquely suited to the constraints and challenges of microcontroller-based systems.
Mitigating Client Drift on Non-IID Sensor Data
SCAFFOLD's primary use case is correcting client drift caused by statistical heterogeneity (Non-IID data). On-device sensors (e.g., accelerometers, microphones) generate data with highly variable distributions per device. SCAFFOLD uses control variates—client-specific and server-specific correction terms—to estimate the update direction as if training on IID data. This is critical for applications like:
- Personalized activity recognition across diverse user populations.
- Industrial predictive maintenance where machine wear patterns differ per unit.
- Environmental monitoring with sensors in varied geographic locations.
Reducing Communication Rounds for Battery-Constrained Devices
A key advantage for TinyML is SCAFFOLD's faster convergence, which directly reduces communication rounds. Each transmission for model updates consumes significant energy on microcontroller units (MCUs). By providing a more accurate update direction, SCAFFOLD often reaches target accuracy in fewer rounds than algorithms like Federated Averaging (FedAvg). This translates to:
- Extended battery life for IoT and wearable devices.
- Lower bandwidth usage over constrained wireless links (e.g., LoRaWAN, BLE).
- Reduced server-side computational overhead for aggregation.
Enabling Stable On-Device Fine-Tuning
SCAFFOLD provides a stable foundation for on-device fine-tuning and continual learning. The control variates act as a memory of the global optimization state, preventing the local model from diverging too far during local adaptation. This is essential for:
- Personalizing a global wake-word model to a specific user's voice without corrupting the base model.
- Adapting a visual anomaly detector to new lighting conditions on a factory camera.
- Mitigating catastrophic forgetting when learning sequentially from local data streams.
Synergy with Differential Privacy for Sensitive Data
When combined with Differential Privacy (DP), SCAFFOLD can offer a favorable privacy-accuracy trade-off. Adding DP noise to client updates increases variance and hurts convergence. SCAFFOLD's variance reduction via control variates can partially compensate for this noise, allowing for a stronger privacy guarantee (smaller epsilon) for a given target model accuracy. This is vital for:
- Healthcare wearables processing physiological data.
- Smart home sensors analyzing private in-home activities.
- Any application requiring formal privacy guarantees under regulations like GDPR.
Practical Implementation Constraints on MCUs
Deploying SCAFFOLD on MCUs requires careful engineering due to memory and compute limits.
- Memory Overhead: Storing client and server control variates doubles the storage requirement compared to just model parameters. For a 100KB model, this means ~200KB of persistent flash storage.
- Compute Overhead: The update rule involves extra vector additions/subtractions. While minimal compared to forward/backward passes, it must be optimized using fixed-point arithmetic.
- State Management: Control variates must be checkpointed reliably across power cycles. This necessitates robust embedded storage management.
Comparison to Related Federated Optimization Algorithms
SCAFFOLD addresses limitations of other common FL algorithms in a TinyML context:
- vs. FedAvg: FedAvg suffers significantly from client drift on heterogeneous data; SCAFFOLD explicitly corrects for it.
- vs. FedProx: FedProx adds a proximal term to limit update magnitude but doesn't correct direction. SCAFFOLD is more theoretically grounded for variance reduction.
- vs. Local SGD: SCAFFOLD can be viewed as a corrected version of Local SGD, where control variates compensate for the bias introduced by local steps. The choice depends on the severity of data heterogeneity, device capabilities, and communication budget.
Frequently Asked Questions
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is an advanced algorithm designed to improve federated learning convergence in the presence of statistically heterogeneous (non-IID) client data. These questions address its core mechanisms, applications, and relationship to other federated learning techniques.
SCAFFOLD (Stochastic Controlled Averaging for Federated Learning) is a federated optimization algorithm that uses control variates—client-specific and server-side correction terms—to reduce the variance in client updates and correct for client drift caused by data heterogeneity. It works by maintaining two sets of variables per client: the local model parameters and a local control variate that estimates the direction of the client's bias relative to the global objective. The server also maintains a global control variate. During each communication round, clients perform local Stochastic Gradient Descent (SGD) but adjust their gradient steps using the difference between the local and global control variates, effectively steering their updates toward the global optimum. After local training, clients send both their model updates and updated local control variates to the server. The server aggregates the model updates via Federated Averaging (FedAvg) and updates the global control variate as a weighted average of the local ones. This mechanism compensates for the statistical heterogeneity in non-IID data, leading to faster and more stable convergence compared to standard FedAvg.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
SCAFFOLD operates within a specialized ecosystem of algorithms and techniques designed for decentralized, privacy-preserving model training and adaptation. These related concepts address the core challenges of data heterogeneity, communication efficiency, and secure aggregation.
Client Drift
A phenomenon where local client models, optimized on their heterogeneous data distributions, diverge from the global objective, hindering convergence of the federated model. Client drift is the primary problem SCAFFOLD is designed to solve.
- Cause: Statistical heterogeneity (non-IID data) across clients.
- Effect: Increased communication rounds, reduced final model accuracy.
- SCAFFOLD's Solution: Introduces control variates that estimate the update direction bias for each client and the server, applying a correction term to local updates to align them with the global objective.
Control Variate
A statistical technique used to reduce the variance of an estimator by using a correlated, known quantity. In SCAFFOLD, control variates are the core innovation: each client and the server maintain a vector that captures the bias in their stochastic gradient estimates.
- Client Control Variate: Tracks the difference between the client's local gradient and the global gradient direction.
- Server Control Variate: Approximates the average client update direction.
- Function: These variates are subtracted from local updates, effectively de-biasing them and reducing the variance introduced by data heterogeneity.
On-Device Fine-Tuning
The process of adapting a pre-trained model using local data directly on a constrained edge device or microcontroller. SCAFFOLD provides a framework for performing such adaptation in a federated context, where fine-tuning occurs locally and only corrective updates are shared.
- Use Case: Personalizing a global model for a specific sensor, user, or environment.
- SCAFFOLD's Role: Enables more stable and efficient personalized updates by correcting for drift, making it suitable for continual on-device learning scenarios where data streams are non-stationary.
Statistical Heterogeneity (Non-IID Data)
The defining characteristic of federated learning where local data distributions across clients are not independent and identically distributed. This data skew causes challenges like client drift and is the central condition SCAFFOLD is optimized for.
- Manifestations: Varying label distributions, feature distributions, or sample sizes per client.
- Impact: Degrades performance of naive averaging (FedAvg).
- SCAFFOLD's Advantage: Its control variate mechanism is explicitly derived to be robust to this heterogeneity, maintaining convergence guarantees where other algorithms fail.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us