Glossary

Apache Iceberg

Apache Iceberg is an open-source table format for managing large, slowly-changing datasets in data lakes, providing ACID transactions, hidden partitioning, and schema evolution.

Get in touch Learn more

Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.

OPEN TABLE FORMAT

What is Apache Iceberg?

Apache Iceberg is a high-performance, open-source table format designed for managing massive analytic datasets in data lakes, providing data warehouse-like reliability on scalable object storage.

Apache Iceberg is an open-source table format for managing large, slowly-changing datasets in data lakes, providing ACID transactions, hidden partitioning, and schema evolution to address reliability and performance limitations of raw object storage. It functions as a specification layer that organizes files into tables with consistent metadata, enabling engines like Apache Spark, Trino, and Flink to interact with data as if it were in a traditional warehouse.

Its architecture separates the physical data layout from the logical table view, enabling key enterprise features. Time travel allows querying historical snapshots, partition evolution lets you change partition schemes without rewriting data, and snapshot isolation ensures concurrent readers and writers do not conflict. This makes Iceberg a foundational component of the modern data lakehouse architecture, bridging data lakes and warehouses.

TABLE FORMAT ARCHITECTURE

Key Features of Apache Iceberg

Apache Iceberg is an open-source table format that brings data warehouse-like reliability and performance to data lakes. Its core features address the fundamental limitations of managing large-scale analytical data on object storage.

ACID Transactions

Apache Iceberg provides ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees on object storage. This ensures data integrity by making concurrent writes safe and preventing readers from seeing partial, uncommitted data.

Atomic commits: Changes to data files and metadata are committed as a single, instantaneous operation.
Serializable isolation: Concurrent writers are prevented from creating conflicting table states.
Consistent reads: Readers always see a complete, consistent snapshot of the table, even during writes.

Hidden Partitioning & Schema Evolution

Iceberg decouples the physical layout of data from its logical representation, enabling powerful evolution capabilities without breaking queries.

Hidden partitioning: Queries filter on table data (e.g., WHERE event_date = '2024-01-01'), not directory paths. The table's partitioning scheme can be changed without requiring SQL queries to be rewritten.
Safe schema evolution: Columns can be added, dropped, renamed, or have their types updated (e.g., INT to BIGINT) in a backward-compatible way. Existing data files remain valid, and readers using older schemas continue to work.

Time Travel & Rollback

Iceberg maintains a full version history of table snapshots, enabling deterministic data auditing and recovery.

Time travel: Query the table's state as it existed at any specific point in time or snapshot ID (e.g., SELECT * FROM table VERSION AS OF 1234 or ... TIMESTAMP AS OF '2024-01-01 10:00:00').
Rollback: Instantly revert the entire table to a previous, known-good state. This is critical for correcting erroneous batch jobs or recovering from data corruption.

Performance Optimizations

The format includes several architectural features designed for high-performance analytics on petabyte-scale datasets.

Advanced metadata: Iceberg maintains rich metadata files (manifest lists and manifests) that catalog every data file, its partition values, and column-level statistics (min/max, null counts).
Metadata pruning: Query engines use this metadata to skip entire files and partitions that cannot contain relevant data, drastically reducing I/O.
Data file pruning: Within files, column-level stats enable further skipping of row groups (in Parquet/ORC).
Partition evolution: Partition schemes can be updated to optimize for new query patterns without a costly full data rewrite.

Format & Engine Agnosticism

Iceberg is designed as an open standard, independent of specific execution engines or underlying file formats.

Multiple engine support: Tables can be created and queried by Apache Spark, Trino, Flink, Apache Hive, and many other compute engines.
File format flexibility: Underlying data files are typically stored in efficient columnar formats like Apache Parquet, but Iceberg itself is format-agnostic.
Object store native: It is optimized for cloud object stores (S3, ADLS, GCS) but works on HDFS and other systems.

Data Lakehouse Foundation

Iceberg is a foundational component of the data lakehouse architecture, merging the best aspects of data lakes and warehouses.

Combines strengths: It provides the low-cost, flexible storage of a data lake with the ACID compliance, schema enforcement, and performance of a data warehouse.
Unified tier: Serves as a single, reliable source of truth for both batch and streaming data, supporting BI, SQL analytics, and machine learning workloads.
Governance-ready: Its immutable snapshot log and rich metadata provide a strong foundation for data lineage, auditability, and governance.

TABLE FORMAT

How Apache Iceberg Works

Apache Iceberg is an open-source table format that structures massive datasets stored in object storage like Amazon S3 or Azure Data Lake Storage, providing a reliable, high-performance abstraction layer for analytical engines.

Apache Iceberg functions as a metadata layer that sits atop files in a data lake, defining tables through a manifest list, manifest files, and data files. This architecture enables ACID transactions and time travel by tracking snapshots of the table's state. Operations like inserts or deletes create new snapshots without rewriting data, ensuring isolation and consistency for concurrent readers and writers. The format's core innovation is decoupling physical data layout from logical query planning.

It provides hidden partitioning and schema evolution, allowing engines to filter data efficiently without directory-based partition discovery and to safely add, rename, or delete columns. Metadata pruning and statistics (like min/max values) at the file level enable fast query planning. Iceberg's design directly addresses the limitations of raw object storage, transforming it into a managed, query-optimized data lakehouse foundation compatible with engines like Spark, Trino, and Flink.

OPEN TABLE FORMAT COMPARISON

Apache Iceberg vs. Delta Lake vs. Hudi

A technical comparison of the three leading open-source table formats designed to bring data warehouse-like reliability and performance to data lakes.

Feature / Capability	Apache Iceberg	Delta Lake	Apache Hudi
Primary Maintainer / Origin	Apache Software Foundation (originated at Netflix)	Linux Foundation (originated at Databricks)	Apache Software Foundation (originated at Uber)
Core Storage Abstraction	Table format with separate metadata, data, and manifest files.	Transaction log (JSON/Parquet) stored alongside data files.	Timeline of actions stored in `.hoodie` directory with data files.
ACID Transaction Guarantees
Hidden Partitioning
Schema Evolution	Add, drop, rename, update, reorder columns.	Add, drop, rename columns (update type with limitations).	Add, drop, rename columns.
Time Travel / Data Versioning	Snapshot-based via manifest lists. Supports branch/tag.	Versioned via transaction log. Direct timestamp/version query.	Snapshot-based via commit timeline. Incremental query support.
Partition Evolution
Data File Format Agnostic	Primarily Parquet, Avro, ORC.	Primarily Parquet.	Primarily Parquet, Avro.
Primary Use Case Focus	Large-scale analytic tables with complex schemas and queries.	Reliable data engineering pipelines and streaming/batch unification.	Fast upserts/change data capture and incremental processing.
Streaming & Batch Unification
Compute Engine Integration	Apache Spark, Trino, Flink, Presto, Hive, Dremio, Snowflake, etc.	Apache Spark, Databricks Runtime, Flink, Presto, Trino, etc.	Apache Spark, Flink, Hive, Presto, Trino, etc.
Performance Optimizations	Advanced planning via manifest files, partition pruning, column stats.	Data skipping via statistics in transaction log, Z-Ordering.	Indexing for upserts (Bloom, HBase, Simple), clustering.

ENTERPRISE DATA ARCHITECTURE

Common Use Cases for Apache Iceberg

Apache Iceberg is a high-performance table format for managing massive analytic datasets in data lakes. Its core features—ACID transactions, hidden partitioning, and schema evolution—solve critical reliability and performance problems inherent to raw object storage.

Time Travel & Data Rollback

Iceberg provides snapshot isolation and immutable snapshots, enabling deterministic time travel queries. This allows users to query data as it existed at a specific point in time, which is critical for:

Reproducible analytics: Auditing and debugging by re-running reports on historical data.
Accidental deletion recovery: Rolling back to a previous snapshot to undo a bad DELETE or UPDATE operation.
Compliance: Meeting regulatory requirements for data retention and audit trails by maintaining a full history of changes.

EXPLORE

Schema Evolution & Safe Migration

Iceberg supports in-place, non-breaking schema evolution, allowing table schemas to be updated without rewriting data or breaking existing queries. Key operations include:

Adding columns: New columns can be added and populated without affecting existing reads.
Renaming columns: Columns can be renamed while preserving existing data; Iceberg manages the mapping.
Evolving types: Certain type changes (e.g., int to long) are supported safely.
Nested field evolution: Adding, removing, or renaming fields within complex structs, maps, and arrays. This eliminates costly, error-prone data migration pipelines and enables agile data product development.

Hidden Partitioning & Partition Evolution

Iceberg implements hidden partitioning, where the physical layout is decoupled from the logical table schema. This solves major pain points of Hive-style partitioning:

No directory-based filters: Users query by column (e.g., WHERE event_date = '2024-01-01'), not by path. Iceberg automatically applies partition transforms.
Partition evolution: The partition scheme of a table can be changed (e.g., from DAY(event_ts) to MONTH(event_ts)) without requiring existing data to be rewritten. New data uses the new scheme while old data remains queryable.
Multiple partition transforms: Supports identity, bucket, truncate, year, month, day, and hour transforms on columns.

Incremental Processing & Change Data Capture (CDC)

Iceberg's snapshot model enables efficient incremental processing. By tracking snapshots, systems can identify precisely what data has changed between two points in time.

INCREMENTAL queries: Use SELECT ... FROM table CHANGES ... syntax to stream only new or modified rows.
Downstream pipeline optimization: Downstream ETL, materialized views, or feature stores only process new data, reducing compute costs and latency.
Merge-on-read for CDC: Efficiently apply updates from operational databases using MERGE INTO statements, which perform an upsert operation by combining new data with the existing table.

Performance Optimization with Data Skipping

Iceberg maintains rich metadata—including manifest files with column-level statistics (min/max values, null counts)—enabling highly efficient data skipping during query planning.

Metadata filtering: Query engines (Spark, Trino, Flink) read the metadata first to prune files that cannot contain relevant data, drastically reducing I/O.
Automatic compaction: Small files created by streaming writes can be automatically compacted into larger files (using rewrite_data_files) to maintain optimal read performance.
Sorting and Z-ordering: Data can be physically ordered within files using Z-order on multiple columns (e.g., user_id, event_date), co-locating related data and maximizing the effectiveness of data skipping.

Multi-Modal Data Lake Foundation

Iceberg serves as a unified table layer for heterogeneous data workloads, making it a cornerstone for data lakehouse architectures.

Unified Batch & Streaming: Serves as both the source and sink for batch (Spark) and streaming (Flink, Kafka Connect) jobs with full ACID guarantees.
Multi-engine consistency: Tables can be written by Spark and immediately queried by Trino, Presto, or Snowflake without inconsistency, thanks to a standardized, open metadata format.
Foundation for ML/Feature Stores: Provides reliable, versioned, and efficiently queryable storage for feature data, acting as a robust backend for feature stores. Its time-travel capability is essential for point-in-time correctness in model training.

APACHE ICEBERG

Frequently Asked Questions

Apache Iceberg is a foundational technology for modern data lakehouses. These questions address its core mechanics, benefits, and how it compares to related technologies.

Apache Iceberg is an open-source, high-performance table format for managing massive analytic tables on scalable object storage like Amazon S3 or Azure Blob Storage. It works by adding a structured metadata layer on top of raw data files (e.g., Parquet, ORC) that tracks the table's complete state, enabling features like ACID transactions, time travel, and schema evolution without locking data. The architecture consists of:

Metadata Files: A catalog pointing to the current "snapshot" of the table.
Manifest Lists: Files that list manifests for a given snapshot.
Manifest Files: Files that list data files with partition and column-level statistics.
Data Files: The actual Parquet/ORC/Avro files containing the table's data. When a query runs, the engine reads the metadata to precisely identify which data files are relevant, enabling efficient partition pruning and file skipping even for complex queries.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

MULTIMODAL DATA STORAGE

Related Terms

Apache Iceberg is a foundational component of modern data architectures. Understanding its relationship to these adjacent technologies is crucial for designing scalable, reliable data platforms.

Data Lakehouse

A data lakehouse is a modern data architecture that merges the flexible, low-cost storage of a data lake with the structured data management and ACID transaction capabilities of a traditional data warehouse. Apache Iceberg is a core enabling technology for this pattern, providing the table format that brings reliability and performance to raw object storage.

Key Integration: Iceberg tables sit atop cloud object stores (like S3) to form the storage layer of a lakehouse.
Core Benefit: Enables both high-volume data science workloads and concurrent business intelligence queries on the same data.

EXPLORE

Delta Lake

Delta Lake is an open-source storage layer, originally developed by Databricks, that provides similar foundational capabilities to Apache Iceberg. Both are table formats designed to bring reliability to data lakes.

Primary Comparison: Like Iceberg, Delta Lake offers ACID transactions, schema evolution, and time travel.
Architectural Difference: While functionally similar, they differ in implementation details, metadata structure, and ecosystem integration. Iceberg is noted for its open, specification-based approach and vendor-neutral community.

Apache Parquet

Apache Parquet is an open-source, columnar storage file format optimized for analytical query performance. It is the most common underlying data file format used within Apache Iceberg tables.

Role in Iceberg: Iceberg manages metadata (what data exists, its schema, partitions), while Parquet files store the actual data rows in a highly compressed, efficient format.
Performance Synergy: Iceberg's hidden partitioning and manifest planning work in concert with Parquet's columnar pruning and predicate pushdown to accelerate queries.

EXPLORE

Metadata Catalog

A metadata catalog is a centralized registry that stores and manages metadata—such as schema, location, lineage, and access policies—for data assets. In an Iceberg architecture, the catalog tracks the current metadata pointer for each table.

Iceberg's Requirement: Iceberg requires a catalog (e.g., AWS Glue, Hive Metastore, Nessie, JDBC) to know which metadata file represents the current, valid state of a table.
Separation of Concerns: The catalog stores the pointer, while Iceberg's metadata files (in object storage) contain the detailed table schema, partition specs, and manifest lists.

ACID Compliance

ACID compliance is a set of properties—Atomicity, Consistency, Isolation, and Durability—that guarantee database transactions are processed reliably. Apache Iceberg implements these properties for data lake tables, a critical advancement over raw files.

Atomicity: Table updates (INSERT, DELETE, MERGE) are committed entirely or not at all.
Isolation: Readers see a consistent snapshot of the table, unaffected by concurrent writes.
Durability: Once a commit succeeds, it is persisted and cannot be lost.

Data Mesh

Data mesh is a decentralized data architecture that organizes data ownership around business domains, treating data as a product. Apache Iceberg is a key enabling technology for implementing the data product concept.

Product Interface: Iceberg provides a stable, well-defined table interface (schema, partitions, snapshots) that domain teams can own and manage.
Interoperability: Its open format allows different domains and consumption tools (Spark, Trino, Flink) to interoperate without proprietary lock-in, supporting the federated governance model of a data mesh.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Apache Iceberg

What is Apache Iceberg?

Key Features of Apache Iceberg

ACID Transactions

Hidden Partitioning & Schema Evolution

Time Travel & Rollback

Performance Optimizations

Format & Engine Agnosticism

Data Lakehouse Foundation

How Apache Iceberg Works

Apache Iceberg vs. Delta Lake vs. Hudi

Common Use Cases for Apache Iceberg

Time Travel & Data Rollback

Schema Evolution & Safe Migration

Hidden Partitioning & Partition Evolution

Incremental Processing & Change Data Capture (CDC)

Performance Optimization with Data Skipping

Multi-Modal Data Lake Foundation

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Data Lakehouse

Apache Parquet

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there