Change Data Capture (CDC) is a software design pattern that identifies, captures, and propagates incremental changes (inserts, updates, deletes) made to data in a source system, enabling near real-time synchronization with downstream databases, data warehouses, or streaming platforms. It works by continuously monitoring the source database's transaction log (e.g., MySQL's binlog, PostgreSQL's Write-Ahead Log) or using triggers/timestamp-based queries to detect row-level changes. These change events are then published as a stream of ordered records, often using a message broker like Apache Kafka, for consumption by target systems.
Key mechanisms include:
- Log-Based CDC: The most robust method, reading the database's native transaction log to capture all changes with minimal performance impact on the source.
- Trigger-Based CDC: Uses database triggers to write changes to a separate shadow table, adding overhead to the source's write operations.
- Query-Based (Timestamp/Incrementing Key) CDC: Polls a table for rows modified after a last-known timestamp or key value; simpler but cannot capture hard deletes and may miss intermediate states.
CDC is foundational for building event-driven architectures and maintaining data freshness in analytics, serving as a core component for data replication, cache invalidation, and microservices synchronization.