Inferensys

Glossary

Watermark

A watermark is a timestamp-based mechanism in stream processing that estimates the progress of event time, signaling when the system believes all data up to a certain point has been received to enable deterministic windowed computations.
Developer building agentic RAG system, retrieval pipeline diagram on laptop, technical workspace with notes.
STREAM PROCESSING

What is a Watermark?

In stream processing, a watermark is a timestamp-based mechanism that estimates the progress of event time, signaling when the system believes all data up to a certain point in time has been received, enabling windowed computations to complete.

A watermark is a timestamp-based progress indicator in event-time stream processing. It represents the system's belief that all events with timestamps earlier than the watermark's value have been observed. This allows the processing engine to safely trigger computations, like window aggregations, knowing that no further late-arriving data for that time period is expected. Watermarks are essential for handling out-of-order data in systems like Apache Flink or Apache Beam.

Watermarks are generated based on observed event timestamps, often with a configured delay or allowed lateness to accommodate expected disorder. They flow through the data pipeline alongside the data stream. When a watermark advances past the end time of a time window, the system can emit the window's final result. This mechanism provides a balance between result completeness and low latency, making deterministic, event-time-based analytics possible in asynchronous, distributed environments.

STREAM PROCESSING

Key Characteristics of Watermarks

In stream processing, a watermark is a timestamp-based mechanism that estimates the progress of event time, signaling when the system believes all data up to a certain point in time has been received, enabling windowed computations to complete.

01

Event Time vs. Processing Time

Watermarks are fundamentally tied to event time, the timestamp when an event actually occurred in the real world, as opposed to processing time, when the system observes the event. Since events can arrive late or out-of-order, the system cannot wait indefinitely. A watermark is a heuristic that estimates completeness, allowing the system to reason about event time progress and trigger computations like window aggregation. For example, a watermark of 12:05 means the system believes all events with timestamps up to 12:05 have been ingested.

02

Heuristic Nature & Late Data

A watermark is a heuristic estimate, not a guarantee. It is calculated based on observed data patterns, source timestamps, and network delays. Systems typically generate watermarks by observing the maximum event timestamp seen, minus a configurable allowable lateness delay (e.g., 10 seconds). This creates a trade-off: a smaller delay reduces output latency but risks discarding more genuinely late data, while a larger delay increases latency but improves result completeness. Late data arriving after the watermark has passed its window may be handled via side outputs or dropped, depending on the system's configuration.

03

Window Triggering Mechanism

The primary function of a watermark is to trigger the evaluation of event-time windows. A window (e.g., a 5-minute tumbling window) is considered ready for computation when the watermark advances past the window's end time. For a window covering 12:00-12:05, the system will close the window and emit a result once the watermark passes 12:05. This mechanism allows for deterministic, repeatable results based on when events happened, not when they were processed, which is critical for accurate time-series analytics over unbounded data streams.

04

Sources of Watermark Generation

Watermarks can be generated at different points in the pipeline:

  • Source-Generated: The most accurate method. The data source (e.g., Apache Kafka, Google Pub/Sub) attaches watermarks based on its knowledge of partition progress. This is common in log-based systems.
  • Operator-Generated: The stream processing engine (like Apache Flink or Apache Beam) generates watermarks internally by observing timestamps from all upstream sources and propagating the minimum watermark across parallel streams. This ensures downstream operators have a consistent view of time progress.
  • Punctuated vs. Periodic: Watermarks can be emitted in response to specific events (punctuated) or at regular time intervals (periodic).
05

Idleness Handling

A critical challenge occurs when a data source or partition becomes idle (stops sending data). If one partition in a parallel stream is idle, its watermark stops advancing. Since the system-wide watermark is often the minimum across all parallel inputs, a single idle source can stall the entire pipeline's time progress, preventing window triggers. Modern systems like Apache Flink have idleness detection to temporarily exclude idle sources from the watermark calculation, allowing time to advance based on active sources. This prevents pipeline stalls due to sporadic data sources.

06

Relation to Exactly-Once Processing

Watermarks are a key component in achieving exactly-once processing semantics in stateful stream processors. For a system to provide consistent, fault-tolerant results, it must know when to take a consistent snapshot (checkpoint) of its state. Watermarks signal a point in the event-time stream where the system can consider a set of windows 'complete enough' to checkpoint their intermediate state. This coordination between watermarks, checkpointing, and state backend storage (like RocksDB) enables the system to recover from failures and produce deterministic, non-duplicated results.

WATERMARK

Frequently Asked Questions

A watermark is a timestamp-based mechanism in stream processing that estimates the progress of event time, signaling when the system believes all data up to a certain point has been received. This enables deterministic windowed computations.

A watermark is a timestamp-based mechanism in stream processing that estimates the progress of event time, signaling when the system believes all data up to a certain point in time has been received. It is a critical concept for handling out-of-order data in systems like Apache Flink, Apache Beam, and Google Cloud Dataflow. Watermarks allow the system to reason about completeness, enabling operations like windowed aggregations to trigger and produce results without waiting indefinitely for late-arriving data. They represent a trade-off between latency (waiting for data) and completeness (accuracy of the result).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.