Inferensys

Glossary

Crash Fault Tolerance (CFT)

Crash Fault Tolerance (CFT) is the property of a distributed system to remain operational and consistent despite the failure of some components, assuming they fail by stopping (crashing).
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
AGENTIC ROLLBACK STRATEGIES

What is Crash Fault Tolerance (CFT)?

A core property of distributed systems enabling continued operation despite component failures.

Crash Fault Tolerance (CFT) is the property of a distributed system that guarantees continued correct operation and data consistency despite the failure of some of its components, under the assumption that failed components simply stop functioning (a 'crash-stop' failure) and do not produce malicious or arbitrary incorrect outputs. This is a foundational concept for building reliable services, forming the basis for consensus protocols like Raft and state machine replication. CFT systems achieve resilience through redundancy, using multiple replicas and coordinated checkpointing to ensure surviving nodes can maintain service and recover state.

In the context of agentic rollback strategies, CFT provides the underlying system-level guarantee that allows an autonomous agent's execution environment to remain stable. When an agent encounters a logical error and must execute a rollback protocol to a previous checkpoint, the CFT property of the supporting infrastructure ensures the rollback's target state is preserved and consistently available. This distinguishes CFT from the more stringent Byzantine Fault Tolerance (BFT), which defends against arbitrary (Byzantine) failures, making CFT sufficient and more efficient for most internal, trusted system components.

FOUNDATIONAL CONCEPTS

Key Characteristics of CFT Systems

Crash Fault Tolerance (CFT) is a fundamental property of distributed systems, enabling them to maintain consistency and liveness despite component failures. These systems operate under the fail-stop model, where faulty nodes simply cease functioning and do not produce arbitrary or malicious outputs.

01

Fail-Stop Failure Model

CFT systems are designed to handle fail-stop faults, where a component fails by halting completely. This is a critical simplifying assumption that distinguishes CFT from Byzantine Fault Tolerance (BFT).

  • Assumption: Failed nodes stop sending messages and do not corrupt data.
  • Implication: The system only needs to detect silence or crashes, not malicious behavior.
  • Contrast: BFT systems must handle arbitrary, potentially malicious faults, requiring more complex and expensive consensus protocols like Practical Byzantine Fault Tolerance (PBFT).
02

Consensus for Safety

CFT systems rely on consensus algorithms to ensure all non-faulty replicas agree on the same sequence of state updates, guaranteeing safety (nothing bad happens).

  • Primary Mechanism: Protocols like Raft and Paxos are the industry standards for CFT consensus.
  • Quorum-Based Decisions: Operations are committed once a majority (quorum) of replicas acknowledge them. This ensures progress even if a minority of nodes crash.
  • Leader-Based Coordination: Typically, a single elected leader sequences commands, simplifying the replication log management. If the leader crashes, a new election is held.
03

Liveness via Redundancy

To remain operational (live) during crashes, CFT systems employ redundancy and automatic failover. The system's availability is a direct function of its replication factor.

  • Replication: Data and computation are replicated across multiple, independent nodes.
  • Failover: Upon detecting a leader or node crash (e.g., via timeouts), the system automatically promotes a healthy replica to take over. This is central to active-passive or active-active high-availability architectures.
  • Trade-off: Increased replication improves availability but adds coordination overhead and resource cost.
04

State Machine Replication (SMR)

A core implementation pattern for CFT is State Machine Replication. Identical deterministic replicas start from the same state and apply the same sequence of commands in the same order.

  • Deterministic Execution: Given the same input log, each replica must produce identical state transitions and outputs. This is non-negotiable for correct rollback and recovery.
  • Log Replication: The consensus algorithm's primary job is to maintain a consistent, fault-tolerant replicated log of commands.
  • Recovery: A crashed and restarted replica can catch up by replaying the committed log from a checkpoint.
05

Checkpointing & Log Truncation

To enable efficient recovery and prevent unbounded log growth, CFT systems use checkpointing.

  • Periodic Snapshots: The system's full state is serialized to stable storage at intervals.
  • Log Compaction: Once a checkpoint is persisted, all log entries preceding it can be safely deleted. This process is called log truncation or compaction.
  • Fast Recovery: A newly started replica can load the latest checkpoint and only replay log entries created after that snapshot, drastically reducing recovery time.
06

System Model & Limits

CFT operates within a specific system model, defining the assumptions about timing and communication that its algorithms can tolerate.

  • Timing Model: Most CFT protocols (like Raft) assume a partial synchronous model—periods of asynchrony are bounded. They use timeouts for failure detection.
  • Fault Threshold: A CFT system with N replicas can tolerate f crash faults as long as N > 2f. For example, a 5-node cluster can tolerate 2 simultaneous crashes and still achieve a quorum (3 nodes).
  • Network Assumptions: They assume reliable links; messages may be delayed or reordered but are not corrupted (corruption is handled by lower-layer protocols like TCP).
AGENTIC ROLLBACK STRATEGIES

How Does Crash Fault Tolerance Work?

Crash Fault Tolerance (CFT) is a fundamental property of resilient distributed systems, enabling continued operation despite component failures that manifest as sudden halts.

Crash Fault Tolerance (CFT) is a system's ability to maintain consistency and liveness when components fail by stopping (crashing) without producing malicious outputs. It operates on a fail-stop model, contrasting with the more complex Byzantine Fault Tolerance (BFT). Core mechanisms include state machine replication, where deterministic replicas process identical command sequences, and consensus protocols like Raft or Paxos, which ensure all operational nodes agree on a single state history, enabling seamless failover.

CFT is implemented via leader election to maintain a single coordinating node and log replication to propagate state changes. Upon a leader crash, the protocol elects a new leader with a complete log, ensuring linearizability. This architecture is foundational for database systems and agentic rollback strategies, where checkpointing provides known-good states for state reversion. CFT assumes non-adversarial, crash-only failures, making it less complex but also less secure than BFT for hostile environments.

FAULT TOLERANCE MODELS

CFT vs. Byzantine Fault Tolerance (BFT)

A comparison of the two primary fault models in distributed systems, focusing on their assumptions, guarantees, and typical use cases within agentic and resilient software architectures.

FeatureCrash Fault Tolerance (CFT)Byzantine Fault Tolerance (BFT)

Core Fault Assumption

Components fail by stopping (crashing).

Components can fail arbitrarily (maliciously, erroneously).

Adversarial Model

Non-adversarial; assumes benign failures.

Adversarial; assumes components may be malicious or buggy.

System Model

Synchronous or partially synchronous network.

Typically requires a synchronous network for guarantees.

Consensus Requirements

Requires agreement from a simple majority (> N/2) of non-faulty nodes.

Requires agreement from a supermajority (> 2N/3) of all nodes to tolerate f faulty nodes.

Message Complexity

Lower (e.g., O(N) per decision in Raft).

Higher (e.g., O(N²) in classic BFT protocols).

Performance Overhead

Low to moderate.

High, due to cryptographic verification and multiple message rounds.

Common Use Cases

Internal datastores (e.g., etcd, ZooKeeper), database replication, agent state coordination.

Blockchain networks, financial settlement systems, secure multi-party computation, defense applications.

Resilience to Malicious Actors

Typical Protocols

Paxos, Raft, Viewstamped Replication.

Practical Byzantine Fault Tolerance (PBFT), Tendermint, HotStuff.

IMPLEMENTATION PATTERNS

Examples of CFT in Practice

Crash Fault Tolerance is implemented through specific distributed systems patterns and algorithms. These examples illustrate how CFT ensures availability and consistency when components fail by stopping.

04

Primary-Backup Replication

This classic CFT pattern involves a primary (active) node that handles all client requests and one or more backup (standby) nodes that replicate the primary's state. The system tolerates crashes through:

  • State transfer: The primary periodically sends its state (or state diffs) to the backups.
  • Failure detection: Using timeouts or a monitoring service.
  • Failover: A designated backup promotes itself to primary upon detecting the primary's crash. Challenges include ensuring exactly-once semantics during failover and managing split-brain scenarios. It's common in database systems (e.g., PostgreSQL streaming replication) and high-availability service setups.
CRASH FAULT TOLERANCE

Frequently Asked Questions

Crash Fault Tolerance (CFT) is a fundamental property of reliable distributed systems. This FAQ addresses its core mechanisms, differences from more complex failure models, and its role in modern agentic and self-healing architectures.

Crash Fault Tolerance (CFT) is the property of a distributed system that ensures continued correct operation and data consistency despite the failure of some of its components, under the assumption that failed components simply stop functioning (a 'crash-stop' failure) and do not produce malicious or arbitrary incorrect outputs.

In a CFT system, the primary goal is to maintain liveness (the system continues to make progress) and safety (the system does not return incorrect results) even when nodes become unresponsive. This is achieved through redundancy, consensus protocols, and state replication. CFT is a foundational concept for building reliable databases, message queues, and the coordination layers for autonomous agents that must operate without interruption.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.