AI integration connects directly to ODF's core operational surfaces: the OpenShift Console plugin for administrator dashboards, the ODF MultiCloud Object Gateway (MCG) APIs for S3-compatible object operations, and the underlying Ceph cluster metrics exposed via Prometheus. The primary targets are the StorageSystem CRD for cluster configuration, CephBlockPool and CephFileSystem objects for performance data, and the NooBaa system for object storage analytics. AI agents consume these real-time telemetry streams and configuration states to build a continuous understanding of your storage environment.
Integration
AI Integration for OpenShift Data Foundation

Where AI Fits into OpenShift Data Foundation Operations
Integrating AI with OpenShift Data Foundation (ODF) transforms reactive storage administration into a predictive, self-optimizing data plane for AI/ML and stateful application workloads.
Practical integration workflows focus on three high-impact areas: predictive capacity planning, where AI analyzes historical PersistentVolumeClaim growth rates and CephPool utilization to forecast shortages weeks in advance, suggesting pool expansion or data tiering; performance bottleneck identification, correlating application pod latency with specific OSD (Object Storage Daemon) metrics, disk latency, or network saturation to pinpoint root cause; and automated tiering policy recommendations, where AI evaluates data access patterns across CephBlockPool replication and CephFileSystem metadata performance to suggest optimal StorageClass configurations and data placement rules. This moves operations from manual log scrutiny to automated insight generation.
A production rollout typically involves a dedicated AI inference service deployed as a workload on the same OpenShift cluster, with secure, read-only access to ODF's Prometheus metrics and Kubernetes API. Governance is critical: AI recommendations should feed into approval workflows (e.g., via OpenShift GitOps) before any automated reconfiguration, and all suggestions must be logged to the cluster's audit trail. Start by integrating AI for read-only analysis and alerting—such as generating daily capacity reports or anomaly alerts—before progressing to supervised automation for non-disruptive tasks like adjusting CephFS MDS cache sizes or generating NooBaa bucket lifecycle policies. This phased approach builds trust in the AI's decision-making while delivering immediate operational visibility.
For teams managing large-scale AI/ML pipelines on OpenShift, this integration is essential. It ensures the data foundation is as dynamic and intelligent as the workloads it supports, preventing storage constraints from becoming the bottleneck for model training and inference jobs. By leveraging ODF's open APIs and metrics, Inference Systems delivers a tailored integration that augments your platform team's expertise, turning petabytes of storage into a managed, predictable asset. Explore related patterns for workload optimization in our guides on AI Integration for OpenShift AI and AI Integration with OpenShift GitOps.
ODF Touchpoints for AI Integration
Predictive Insights from ODF Metrics
Integrate AI agents with OpenShift Data Foundation's Prometheus metrics endpoint to analyze historical storage consumption and I/O patterns. By processing time-series data for Ceph pools, RBD images, and CephFS volumes, AI models can forecast capacity exhaustion weeks in advance and identify subtle performance bottlenecks—like latency spikes correlated with specific workload schedules or backend OSD imbalances.
Key integration points include the ODF NooBaa and Ceph dashboards exposed via the OpenShift Console, where AI can extract metrics for object_bucket_claims, persistent_volume_claims, and storage_class utilization. This enables proactive alerts and automated generation of capacity planning reports for platform teams, shifting operations from reactive firefighting to predictive management.
High-Value AI Use Cases for ODF
OpenShift Data Foundation (ODF) provides persistent storage for stateful AI workloads on OpenShift. Integrating AI directly with ODF's management APIs and metrics enables predictive operations, automated tiering, and intelligent capacity planning for platform and storage teams.
Predictive Capacity Planning & Alerting
Analyze ODF's PersistentVolumeClaim usage trends, Ceph pool utilization, and cluster growth metrics to forecast capacity exhaustion. AI agents can trigger preemptive scale-out workflows via the ODF API or OpenShift Machine API, moving alerts from reactive to predictive.
Automated Storage Tiering Recommendations
Evaluate workload I/O patterns (read/write latency, throughput) from ODF metrics to recommend optimal StorageClass assignments (e.g., performance vs. cost-optimized). AI can generate and apply StorageClass change policies during non-peak hours via Kubernetes batch jobs.
Performance Bottleneck Identification
Correlate application performance degradation with ODF backend metrics (Ceph OSD latency, network throughput). AI agents analyze logs and metrics to pinpoint if slowness originates from storage, network, or node resources, generating targeted troubleshooting runbooks for SREs.
Anomalous Access Pattern Detection
Monitor PersistentVolume access patterns to detect potential ransomware activity or misconfigured batch jobs. AI models baseline normal I/O behavior and trigger security workflows (snapshot, quarantine) via integration with OpenShift Security Operator or external SIEM platforms.
Cost-Optimized Snapshot & Backup Scheduling
Intelligently schedule ODF volume snapshots and backups based on application change rate and RPO requirements. AI analyzes write patterns to minimize snapshot frequency during high-churn periods and automate lifecycle policies for backup storage tiers (e.g., moving to object storage).
AI Workload Storage Provisioning
Automate provisioning of high-performance storage for GPU-intensive training jobs. AI agents intercept PipelineRun or Job creation in OpenShift AI, analyze requested GPU/CPU resources, and dynamically provision ODF volumes with appropriate StorageClass and performance characteristics.
Example AI-Driven Storage Workflows
These workflows illustrate how AI agents and copilots can integrate with OpenShift Data Foundation's APIs and metrics to automate storage operations, predict issues, and optimize resource allocation for platform engineering and SRE teams.
Trigger: Scheduled cron job or Prometheus alert rule fires when ODF cluster capacity exceeds 70%.
Context/Data Pulled:
- ODF API: Current
StorageClusterstatus,CephClusterhealth, and pool utilization metrics. - Prometheus: Historical usage trends for
ceph_cluster_total_used_bytesandceph_pool_bytes_usedover the last 90 days. - OpenShift API: Project/namespace growth rates and associated
PersistentVolumeClaim(PVC) creation patterns.
Model/Agent Action:
- An AI agent analyzes the historical growth rate using time-series forecasting.
- It correlates growth with active projects and upcoming deployment schedules (pulled from OpenShift
DeploymentConfigsor GitOps tooling). - The model predicts the date the cluster will reach 85% and 95% capacity under current trends.
System Update/Next Step:
- The agent generates a detailed report and posts it to a designated Slack/Teams channel.
- It creates a Jira Service Management ticket with a pre-filled recommendation: "Add 3 OSD nodes of type
standard_8by [predicted date] to maintain 20% headroom." - If integrated with Spectro Cloud or infrastructure provisioning, it can draft a Terraform/Ansible change request for the new nodes.
Human Review Point: The capacity expansion recommendation and generated ticket require platform team approval before any automated provisioning is executed.
Implementation Architecture: Data Flow and Guardrails
A practical blueprint for integrating AI agents with OpenShift Data Foundation (ODF) to automate capacity planning, performance analysis, and tiering policy management.
The integration connects AI agents to ODF's core data surfaces via its Prometheus metrics endpoint, the OpenShift Data Foundation Dashboard API, and the OpenShift API for cluster and namespace metadata. Agents continuously ingest time-series data on pool capacity, object bucket usage, IOPS/latency per StorageClass, and Ceph health status. This raw telemetry is enriched with contextual data from OpenShift—such as project labels, pod resource requests, and workload types—to build a holistic view of storage consumption patterns and performance demands.
For predictive workflows, the AI analyzes historical trends to forecast capacity exhaustion dates for each StoragePool and CephBlockPool, flagging pools projected to hit critical thresholds within the next 30 days. It correlates performance metrics (e.g., high latency on ocs-storagecluster-ceph-rbd) with specific workloads and node conditions, suggesting optimizations like adjusting CephBlockPool replication settings or migrating volumes between performance tiers. The system can generate and, upon approval, apply StorageClass or CephFilesystemSubVolumeGroup configurations to implement recommended tiering policies, moving less-active data to cost-efficient object storage.
All AI-driven recommendations and actions are governed by a multi-step approval workflow integrated with OpenShift's RBAC and GitOps pipelines. Proposed policy changes are output as structured YAML manifests (e.g., a new StorageCluster configuration or CephBlockPool spec) and submitted as Pull Requests to a Git repository monitored by Argo CD. Platform engineers review the changes in context, with the AI providing a clear rationale citing the underlying metrics. Any automated corrective action, such as triggering a NooBaa bucket cleanup job, is logged as an event in ODF and creates an audit trail in the cluster's OpenShift Audit Logs, ensuring full traceability for compliance.
Code and Payload Examples
Analyzing Storage Trends with Python
Use the OpenShift Data Foundation (ODF) metrics API to retrieve historical usage data, then apply a simple forecasting model to predict future capacity needs. This example uses the prometheus-api-client to query ODF's integrated Prometheus instance for ceph_cluster_total_used_bytes.
pythonimport pandas as pd from prometheus_api_client import PrometheusConnect from sklearn.linear_model import LinearRegression import numpy as np # Connect to the ODF Prometheus endpoint prom = PrometheusConnect(url="https://prometheus-odf-openshift-storage.apps.example.com", disable_ssl=True) # Query used bytes over the last 30 days metric_data = prom.get_metric_range_data( metric_name='ceph_cluster_total_used_bytes', start_time="30d", end_time="now", chunk_size="1d" ) # Process timestamps and values dates = [pd.to_datetime(point[0], unit='s') for point in metric_data[0]['values']] values = [float(point[1]) / (1024**4) for point in metric_data[0]['values']] # Convert to TiB # Create a simple linear forecast df = pd.DataFrame({'day': range(len(values)), 'used_tib': values}) model = LinearRegression() model.fit(df[['day']], df['used_tib']) # Predict for the next 7 days future_days = np.array(range(len(values), len(values)+7)).reshape(-1, 1) predicted_usage = model.predict(future_days) # Trigger alert if predicted to exceed 80% of total capacity in 7 days total_capacity_tib = 100 # Example: Get from 'ceph_cluster_total_bytes' if any(pred > total_capacity_tib * 0.8 for pred in predicted_usage): print("ALERT: Projected to exceed 80% capacity within 7 days.")
This script helps platform teams proactively add storage before users experience issues, automating a key FinOps and capacity planning task.
Realistic Time Savings and Operational Impact
This table illustrates the operational impact of integrating AI with OpenShift Data Foundation (ODF) for predictive analytics and automated policy management, moving from reactive to proactive storage administration.
| Storage Operation | Before AI Integration | After AI Integration | Implementation Notes |
|---|---|---|---|
Capacity forecasting and planning | Manual analysis of usage trends, quarterly reviews | Automated 30/60/90-day forecasts with confidence intervals | AI analyzes ODF metrics and Prometheus data; human review for major procurement |
Performance bottleneck identification | Reactive troubleshooting after user reports slowness | Proactive alerts on I/O patterns and latency spikes | Correlates Ceph metrics with node/network telemetry; suggests targeted investigations |
Storage tiering policy optimization | Static policies based on initial workload assumptions | Dynamic policy recommendations based of access frequency | AI reviews object bucket and PVC access logs; policies applied via ODF console or GitOps |
Volume failure prediction | Relies on hardware SMART alerts or post-failure RCA | Predictive alerts on disk/OSD health degradation trends | Models trained on historical failure data; reduces unplanned downtime but not eliminates risk |
Garbage collection and rebalancing scheduling | Fixed schedules or manual triggers during maintenance windows | Intelligent scheduling based on cluster load and performance impact | Minimizes performance hit during peak business hours; integrates with ODF maintenance APIs |
Anomaly detection in usage/performance | Manual dashboard monitoring or threshold-based alerts | Automated baseline establishment and deviation detection | Reduces alert fatigue by filtering noise; surfaces genuine outliers for engineer review |
Audit report generation for compliance | Manual compilation of logs and configuration snapshots | Automated report drafting with highlighted exceptions | AI pulls from ODF audit logs, Kubernetes events, and policy states; human finalizes and submits |
Governance, Security, and Phased Rollout
Integrating AI with OpenShift Data Foundation (ODF) requires a security-first, policy-driven approach to ensure reliability and control.
AI integration with ODF must be scoped to specific, high-value surfaces within the storage stack. Key integration points include the OpenShift Data Foundation Dashboard API for capacity and performance metrics, Prometheus endpoints for time-series data on volume latency and throughput, and the OCP/ODF Operator's configuration layer for policy recommendations. AI agents should be designed to generate actionable insights—like predicting when a PersistentVolumeClaim (PVC) will exhaust its storage class—without direct write access to production configurations, maintaining a clear separation of duties.
A production implementation typically involves a dedicated service account with read-only access to ODF metrics and a secure, out-of-band workflow engine. For example, an AI agent analyzing Ceph pool performance might detect a bottleneck and generate a Jira ticket or ServiceNow incident with a recommended adjustment to the StorageClass replicaCount. The actual change is then executed by a platform engineer or through a pre-approved GitOps pipeline, creating a full audit trail. This pattern ensures AI augments human decision-making within existing RBAC and change management procedures.
A phased rollout is critical. Start with a non-critical development cluster, focusing on predictive capacity alerts for block and file storage. Use this phase to tune AI models on your specific data patterns and validate alert accuracy. Phase two can introduce performance bottleneck identification for production workloads, correlating ODF metrics with application performance data from OpenShift Monitoring. The final phase involves closed-loop recommendations for automated tiering policies, where AI suggests moving cold data to a cost-effective storage class, but execution requires manual approval. This gradual approach builds trust, refines governance, and isolates risk.
Security is paramount. All AI interactions with ODF APIs must use short-lived service account tokens, and any vector data used for pattern analysis (like historical usage trends) should be anonymized and stored in a dedicated, encrypted vector database. Implement strict network policies to limit traffic between your AI inference services and the ODF management plane. By treating AI as a privileged, yet tightly governed, observer within your storage operations, you gain intelligent foresight without compromising the stability or security of your core data infrastructure.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Frequently Asked Questions
Practical questions about embedding AI agents and predictive analytics into ODF workflows for storage operations, capacity planning, and performance management.
AI agents connect to ODF's Prometheus metrics endpoint and the OpenShift Console's ODF plugin APIs to analyze time-series data. A typical integration workflow includes:
- Trigger: A new alert is generated by ODF's built-in monitoring (e.g.,
CephHealthError). - Context Pulled: The AI agent retrieves related metrics for the last 24 hours: pool utilization, OSD performance, network latency, and recent configuration changes from the ODF operator's status.
- Agent Action: The LLM analyzes the correlated data to suggest a root cause—for example, "High
client_opsonpool-ssdcorrelates with a recent PVC expansion; check for a single tenant's workload." - System Update: The agent creates a summarized incident ticket in the connected ITSM platform (e.g., ServiceNow) with the analysis and suggested CLI commands for investigation.
- Human Review: The storage administrator reviews the ticket and can approve the agent to execute a safe remediation, like rebalancing a Ceph pool, via a secure, audited tool-calling API.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us