Inferensys

Glossary

Object Storage

Object storage is a data storage architecture that manages data as discrete units called objects, each containing the data itself, metadata, and a globally unique identifier, accessed via RESTful APIs.
Data scientist building training data pipeline on laptop, data preprocessing visible, technical workspace.
MULTIMODAL DATA STORAGE

What is Object Storage?

Object storage is the foundational, scalable architecture for managing unstructured data in modern AI and data platforms.

Object storage is a data storage architecture that manages information as discrete units called objects, each containing the data itself, a variable amount of metadata, and a globally unique identifier, typically accessed via a RESTful API over HTTP/HTTPS. Unlike traditional file systems with hierarchical directories or block storage with fixed-size blocks, objects are stored in a flat address space within a massive, scalable namespace. This design is inherently optimized for storing vast amounts of unstructured data—such as images, videos, audio files, sensor logs, and model checkpoints—making it the de facto backbone for data lakes, multimodal AI training datasets, and cloud-native applications.

Key characteristics include infinite scalability, cost-effectiveness through commodity hardware, and resilience via data replication or erasure coding. Its rich, user-defined metadata enables advanced data management and is crucial for cross-modal retrieval systems. While not suited for high-frequency transactional updates, its simplicity and durability make it ideal for write-once, read-many workloads. Major implementations include Amazon S3, Google Cloud Storage, and Azure Blob Storage, which serve as the primary storage layer for Apache Iceberg and Delta Lake table formats that add transactional capabilities on top.

OBJECT STORAGE

Core Architectural Features

Object storage is defined by a set of core architectural principles that differentiate it from traditional file and block storage, enabling its massive scale, durability, and API-driven access.

01

Flat Namespace & Unique Identifiers

Unlike hierarchical file systems with nested directories, object storage uses a flat address space. Each object is assigned a globally unique identifier (GUID), often a 128-bit hash, which serves as its sole address. This eliminates directory traversal overhead and allows for near-infinite, linear scalability. Objects are logically organized using key-value metadata, where the key can simulate a file path (e.g., photos/vacation/beach.jpg), but this is purely a naming convention for the client, not a physical directory structure.

02

RESTful API (HTTP/HTTPS) Access

Primary interaction with object storage is via RESTful APIs over HTTP/HTTPS, using standard verbs:

  • PUT to upload an object
  • GET to retrieve an object
  • DELETE to remove an object
  • HEAD to fetch metadata

This API-centric model makes it inherently cloud-native and accessible from any application or tool that speaks HTTP, without requiring specialized filesystem drivers. It is the foundation for services like Amazon S3, which defined the de facto standard S3 API.

03

Rich, Customizable Metadata

Each object carries system metadata (size, creation date, content-type) and extensive user-defined metadata. This metadata is stored as key-value pairs directly with the object, enabling powerful, index-free filtering and categorization. For multimodal data, this is critical for storing context like:

  • modality: video_audio
  • source_sensor: lidar_v1
  • embedding_model: clip-vit-base-patch32
  • data_license: CC-BY-4.0

This allows applications to query and manage data based on its attributes without maintaining a separate database.

04

Eventual Consistency & Strong Consistency Models

Distributed object storage systems often default to eventual consistency for performance at global scale: a PUT followed immediately by a GET in a different region might not return the new object. However, modern systems offer strong consistency guarantees (read-after-write consistency) as an option, ensuring any read returns the most recent write. This is a crucial architectural trade-off. Multimodal pipelines requiring strict data versioning for aligned text-video pairs must use strongly consistent operations or application-level checks.

05

Immutability & Versioning

Objects are fundamentally immutable; they cannot be partially updated. Changing an object requires rewriting it entirely. This aligns perfectly with write-once, read-many (WORM) data patterns common in AI/ML (training datasets, model checkpoints). Object versioning is a key feature that preserves every version of an object when enabled, providing a built-in audit trail and protection against accidental deletion or corruption. This is essential for data lineage and reproducible machine learning.

06

Durability via Erasure Coding

Object storage achieves extreme durability (e.g., 99.999999999% - 11 nines) not through simple replication but through erasure coding. Data is broken into fragments, mathematically encoded with redundant parity fragments, and distributed across multiple failure domains (racks, data centers). The original object can be reconstructed from a subset of the fragments. This provides superior durability with significantly lower storage overhead compared to full replication, making it cost-effective for petabytes of multimodal training data.

ARCHITECTURE

How Object Storage Works: The Object Model

Object storage is the foundational architecture for modern data lakes and multimodal AI systems, managing unstructured data as discrete, self-contained objects rather than files in a hierarchy.

Object storage is a data storage architecture that manages data as discrete units called objects, each containing the data itself, a variable amount of metadata, and a globally unique identifier, typically accessed via a RESTful API. Unlike traditional file systems with hierarchical directories, objects are stored in a flat address space within a massively scalable bucket, enabling near-infinite horizontal scaling for unstructured data like images, videos, and model checkpoints.

The object model decouples data from application servers, allowing direct access over HTTP/S. Each object's metadata is extensible, enabling rich tagging for AI workflows, while the unique identifier allows location-independent retrieval. This architecture, fundamental to data lakes and multimodal data storage, provides durability through erasure coding and cost-efficiency via automated tiered storage policies, making it ideal for petabyte-scale AI training datasets.

COMPARISON

Object Storage vs. Block vs. File Storage

A technical comparison of the three fundamental data storage architectures, highlighting their core mechanisms, access patterns, and suitability for different workloads within a multimodal data architecture.

FeatureObject StorageBlock StorageFile Storage

Data Model & Unit

Discrete objects containing data, metadata, and a globally unique ID (e.g., Amazon S3 key).

Raw, fixed-size blocks of data (e.g., 4KB, 8KB blocks). Addressable by block index.

Files organized in a hierarchical directory tree (folders and subfolders).

Primary Access Protocol

RESTful HTTP APIs (GET, PUT, DELETE).

Low-level block protocols (SCSI, iSCSI, Fibre Channel).

File-level protocols (NFS, SMB/CIFS, Lustre).

Metadata Handling

Extensible, custom user-defined metadata stored with the object.

Limited to basic system metadata (e.g., block address, volume ID).

Fixed, system-defined metadata (e.g., filename, size, permissions, timestamps).

Scalability Limit

Effectively limitless, scales horizontally across a flat namespace.

Scales vertically; limited by the size of the individual volume/LUN.

Scales vertically; limited by the capacity of the individual file system.

Typical Performance Profile

High throughput for large, sequential reads/writes. Higher latency for individual operations.

Consistent, ultra-low latency for random read/write operations.

Good latency for file operations within a localized directory structure.

Ideal Workload

Unstructured data, archives, backups, multimedia, large-scale analytics, web content.

Databases (RDBMS, NoSQL), virtual machine disks, high-performance transactional systems.

Shared documents, home directories, source code repositories, traditional application data.

Modification Paradigm

Objects are immutable. Updates require rewriting the entire object.

Blocks are mutable. Specific blocks can be overwritten in-place.

Files are mutable. Specific bytes within a file can be overwritten in-place.

Cost Structure

Lowest cost per gigabyte, often with tiering for infrequent access.

Highest cost per gigabyte, premium for performance and low latency.

Moderate cost, varies with performance features (e.g., high IOPS file systems).

MULTIMODAL DATA STORAGE

Object Storage in AI & Machine Learning

Object storage is the foundational, scalable data lake architecture for managing the vast, unstructured datasets required for modern AI. It provides the durable, cost-effective substrate for multimodal data, model artifacts, and embeddings.

01

Core Architecture: Objects, Buckets, and APIs

Object storage manages data as discrete objects, not files in a hierarchy. Each object contains:

  • Data: The raw bytes (e.g., an image, video, or model checkpoint).
  • Metadata: Custom key-value pairs describing the object (e.g., source=camera_5, model_version=v2.1).
  • Globally Unique Identifier: An immutable address, like a UUID.

Objects are organized into flat namespaces called buckets (or containers). Access is exclusively via RESTful HTTP APIs (e.g., S3, GCS, Swift), making it ideal for cloud-native, distributed applications. This contrasts with block storage (for databases) or file storage (for shared drives).

02

The Foundation for Data Lakes & Lakehouses

Object storage is the default backend for modern data lakes due to its:

  • Massive Scalability: Petabyte-scale capacity that grows linearly.
  • Cost Efficiency: Significantly lower cost per gigabyte than block or file storage.
  • Durability: Designed for 99.999999999% (11 nines) data durability via replication or erasure coding.

Formats like Apache Parquet and Apache ORC store tabular data efficiently on object stores. Table formats like Apache Iceberg, Delta Lake, and Apache Hudi layer on top, providing ACID transactions, schema evolution, and time travel, transforming a raw data lake into a managed data lakehouse.

03

Storing Multimodal AI Data

AI training pipelines consume heterogeneous, unstructured data perfectly suited for object storage:

  • Images & Video: Raw footage, labeled frames, and augmented copies.
  • Audio: Speech samples, environmental sounds, and spectrograms.
  • Text Corpora: Massive JSONL files, web crawls, and document scans.
  • Sensor Data: Time-series telemetry from IoT devices.
  • Model Artifacts: Multi-gigabyte checkpoint files (.ckpt, .safetensors), frozen graphs, and ONNX models.

Metadata tags (e.g., modality=video, label=cat) enable efficient data discovery and pipeline orchestration without complex directory trees.

04

Integration with Vector Databases & Feature Stores

Object storage works in concert with specialized AI data systems:

  • Vector Databases: Store the computed embeddings in the vector index, while the original source files (images, PDFs) remain in object storage. The vector DB references the object's URI.
  • Feature Stores: Store large, pre-computed offline feature datasets as Parquet files in object storage, serving them for model training. High-speed online stores handle low-latency inference.
  • Metadata Catalogs: Services like AWS Glue Data Catalog or Open Metadata index the objects' metadata, enabling SQL-based discovery across petabytes of multimodal data.
05

Performance & Optimization for AI Workloads

While not designed for low-latency transactional workloads, object storage is optimized for AI's high-throughput patterns:

  • Sequential Reads: Training pipelines stream large datasets sequentially, maximizing bandwidth.
  • Compute Offload: Frameworks like Apache Spark and Ray process data directly in the storage cluster, minimizing data movement (the 'bring compute to data' paradigm).
  • Intelligent Tiering: Automatically moves infrequently accessed data (old model versions, raw archives) to cheaper archive tiers (e.g., S3 Glacier).
  • Multi-Part Uploads: Enables parallel, resilient uploads of multi-gigabyte model files.

Challenges include eventual consistency models and higher latency for small, random reads compared to block storage.

06

Security, Governance, and Compliance

Enterprise AI requires robust data controls, which object storage provides:

  • Encryption: Encryption at rest (AES-256) and in transit (TLS 1.2+).
  • Access Policies: Fine-grained Identity and Access Management (IAM) policies control which users, roles, or services can read/write objects.
  • Immutability & Versioning: Object versioning protects against accidental deletion. Write-Once-Read-Many (WORM) or object lock policies create immutable backups for audit trails and ransomware protection.
  • Lineage & Logging: All API calls are logged to services like AWS CloudTrail, providing an audit trail for data lineage and compliance (GDPR, HIPAA).
OBJECT STORAGE

Frequently Asked Questions

Object storage is the foundational architecture for modern data lakes and multimodal AI systems. These questions address its core mechanisms, advantages, and role in enterprise AI infrastructure.

Object storage is a data storage architecture that manages data as discrete units called objects, each containing the data itself, a variable amount of metadata, and a globally unique identifier (typically accessed via a RESTful API like Amazon S3). Unlike file systems with hierarchical directories or block storage with fixed-size blocks, objects are stored in a flat address space. To retrieve data, a client application sends the object's unique identifier to the storage system via an API call (e.g., HTTP GET). The system locates the object using its identifier, returns the data and metadata, and handles underlying complexities like physical location, redundancy, and scaling transparently.

Key components of an object:

  • Data (Payload): The actual file content (e.g., image, video, log file).
  • Metadata: Customizable key-value pairs describing the object (e.g., author, created-date, modality=video).
  • Globally Unique Identifier: An immutable address (e.g., a UUID) that is not tied to physical location.

This architecture is inherently scalable, making it ideal for unstructured data like the text, audio, and video used in multimodal AI pipelines.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.