AI integration for Splunk Cloud Observability focuses on the metrics, traces, and logs ingested into the platform, particularly from sources like APM tools (e.g., Dynatrace, New Relic), infrastructure monitors, and custom application telemetry. The goal is to correlate anomalies in performance KPIs—such as sudden CPU spikes, elevated error rates, or increased API latency—with concurrent security events from your SIEM data. This creates a unified investigative surface where a crypto-mining alert on a virtual machine can be instantly linked to the anomalous process and network telemetry that triggered it, or where a DDoS attack's impact on service availability is quantified in real-time business terms.
Integration
AI Integration for Splunk Cloud Observability

Where AI Fits in Splunk Cloud Observability
Integrate AI with Splunk Cloud Observability to move beyond siloed dashboards and connect security incidents to their root causes in infrastructure and application performance.
Implementation typically involves deploying AI models that analyze time-series data within Splunk's Metric Store and ITSI service models. These models establish behavioral baselines for normal performance and flag deviations. When a significant anomaly is detected, an automated workflow can: 1) Query the Splunk Security Enterprise correlation engine for related security alerts within the same time window and entity scope (host, user, IP). 2) Use a large language model to synthesize a narrative that explains the potential link—for example, 'The 400% increase in outbound network traffic from app-server-05 coincides with a malware detection alert and anomalous spawning of powershell.exe processes.' 3) Enrich the resulting incident in Splunk Mission Control or a connected ITSM tool like ServiceNow with this cross-domain context, priority, and suggested diagnostic SPL searches.
Rollout requires careful governance, as observability data volumes are high. Start by applying AI correlation to a few critical business services defined in Splunk ITSI, where performance directly impacts revenue or operations. Use Splunk's Data Stream Processor or ingest-time SPL to filter and tag relevant telemetry before analysis, controlling cost. Establish clear review workflows; AI-generated hypotheses should aid analyst judgment, not replace it. This integration turns Splunk from a tool of record into a system of insight, where security and operations teams share a single, AI-accelerated view of incidents that manifest across both domains.
Key Integration Surfaces in Splunk Cloud Observability
Service Intelligence (ITSI)
Splunk IT Service Intelligence (ITSI) is the primary surface for correlating performance degradation with potential security incidents. AI integration here focuses on the Service Analyzer, Glass Tables, and KPIs.
Key integration points include:
- KPI Base Searches: Inject AI models to analyze metric streams (CPU, memory, latency) for subtle anomalies that may indicate crypto-mining, data exfiltration, or lateral movement. This moves beyond static thresholds to behavioral baselining.
- Episode Review: Use generative AI to automatically summarize multi-KPI episodes, suggesting whether the root cause is likely infrastructure, application, or security-related based on correlated log patterns.
- Predictive Analytics: Leverage the ITSI Machine Learning Toolkit to forecast service degradation, enabling proactive investigation of anomalies before they trigger user-impacting alerts. This creates a feedback loop where predicted performance issues can be cross-referenced with security threat feeds.
Implementation typically involves embedding model inference into SPL searches that feed KPIs, or using the ITSI API to fetch real-time episode data for external AI processing.
High-Value AI Use Cases for Splunk Observability
Integrating AI with Splunk Cloud Observability moves beyond siloed monitoring to detect incidents where security threats manifest as performance anomalies, and vice-versa. These use cases focus on correlating metrics, traces, and logs to provide unified, context-rich insights.
Anomalous Resource Consumption Detection
AI models analyze CPU, memory, and network I/O telemetry across servers and containers to establish behavioral baselines. Deviations—like a sudden, sustained spike in CPU on a non-critical server—trigger automated investigations that cross-correlate with security logs (e.g., process execution, network connections) to identify crypto-mining, data exfiltration, or malware.
Service Degradation Root Cause Triage
When a business service KPI degrades in Splunk ITSI, an AI agent reviews the underlying infrastructure and application trace data. It generates a ranked list of probable root causes (e.g., a specific microservice, database query, or underlying host) and automatically cross-references recent security events (failed logins, firewall denies) on those components to rule out or highlight malicious activity.
User Experience & Security Correlation
Correlate Real User Monitoring (RUM) data—like increased page load times or JavaScript errors for a specific user cohort—with authentication and access logs. AI identifies if performance issues are isolated to users who authenticated from new geographies or devices, potentially indicating a credential stuffing attack that is creating abnormal session load.
API Latency Anomaly as Attack Signal
Monitor API endpoint latency and error rates from Splunk APM. AI detects unusual patterns, such as specific endpoints slowing down for no clear infrastructure reason, and triggers a security investigation. It automatically queries relevant audit logs for those endpoints to check for brute-force attempts, data scraping, or anomalous payloads that might be causing the degradation.
Container & Orchestration Behavior Baselining
Apply AI to Kubernetes audit logs, pod lifecycle events, and resource metrics ingested into Splunk. The system learns normal scaling, scheduling, and image pull behavior. It then flags anomalies—like a pod suddenly requesting excessive privileges or communicating externally on a non-standard port—which could indicate a compromised container or a supply chain attack.
Unified Incident Narrative Generation
When an alert fires from either the observability or security side of Splunk, an AI workflow automatically pulls relevant context from both domains. It generates a unified incident summary that explains, for example, how a database performance issue (observability) coincides with a surge in failed SQL login attempts (security), providing a complete picture for the on-call engineer or SOC analyst.
Example AI-Driven Workflows
These workflows demonstrate how AI can analyze combined security and performance data in Splunk Cloud to detect sophisticated threats that manifest as operational issues, and vice-versa. Each flow is triggered by telemetry, uses AI to find hidden connections, and drives a concrete action.
Trigger: A Splunk ITSI service health score drops due to sustained high CPU utilization on a subset of application servers.
Context/Data Pulled:
- The AI agent queries Splunk for:
- Recent
metricsdata for the affected hosts (CPU, memory, network I/O). - Associated
processcreation and network connection logs from endpoint agents (e.g., CrowdStrike, Tanium logs ingested into Splunk). - Outbound network connections to known crypto mining pool IPs/domains from threat intelligence feeds.
- Recent
Model/Agent Action: A multi-modal analysis is performed:
- Behavioral Correlation: The AI model correlates the spike in CPU usage with the execution of unknown or suspicious processes (e.g.,
xmrig,cpuminer). - Network Anomaly Detection: It analyzes outbound traffic patterns, flagging connections to IPs with low reputation scores or on non-standard ports commonly used for mining.
- Confidence Scoring: The agent generates a high-confidence security incident, linking the performance degradation (observability event) to a confirmed security threat (cryptojacking).
System Update/Next Step:
- A high-severity Notable Event is automatically created in Splunk Enterprise Security.
- The incident is enriched with all correlated data: hostnames, process IDs, destination IPs, and the ITSI KPI alert.
- An Adaptive Response action is triggered to isolate the affected endpoints via integrated EDR tools and block the malicious IPs at the network perimeter.
Human Review Point: The SOC analyst reviews the auto-generated incident narrative and evidence. The AI provides a clear link: "Performance issue caused by unauthorized cryptocurrency mining software." The analyst approves the containment actions and initiates a hunt for the initial compromise vector (e.g., a vulnerable web server).
Implementation Architecture: Data Flow & Components
A practical blueprint for integrating AI with Splunk Cloud to detect security incidents manifesting as performance anomalies and vice-versa.
The integration architecture connects AI inference to Splunk Cloud's core data pipeline and search head. The primary flow begins with a scheduled Splunk search or a Data Stream Processor (DSP) query that runs across both security (_audit, wineventlog, ids_data) and observability (metrics, apm_traces, infrastructure_logs) data sources. This search identifies correlation patterns—like a spike in CPU metrics from a server host that coincides with outbound network flow data to a known crypto-mining pool. The search results, including key fields (host, user, source, sourcetype, _time), are passed as a structured JSON payload via the Splunk HTTP Event Collector (HEC) or a secure webhook to an external AI service endpoint hosted by Inference Systems.
Our AI service, built on a RAG (Retrieval-Augmented Generation) pipeline, enriches this data. It first queries a vector database (e.g., Pinecone) containing indexed internal knowledge (past incident reports, asset criticality from a CMDB) and relevant external threat intelligence. A large language model then synthesizes the Splunk data and retrieved context to generate a narrative analysis. This output is a concise, plain-language summary explaining the probable link between the performance degradation and the security event, assessing confidence, and suggesting immediate investigative steps. This analysis is sent back to Splunk via HEC, creating a new ai_correlation_alert event. A Splunk Adaptive Response Action or a Phantom playbook can be triggered by this alert to automatically update a ServiceNow incident, tag the relevant host in the Cortex XDR console, or create a high-priority Splunk Enterprise Security Notable Event for the SOC.
For governance and rollout, we implement this in phases. Phase 1 establishes a read-only integration for a single, high-value use case (e.g., detecting crypto-mining on web servers). All AI-generated outputs are written to a dedicated Splunk index (ai_correlations) with strict RBAC and include an audit trail of the source query and model version. Phase 2 introduces feedback loops, where analyst actions (like closing an incident) are sent back to fine-tune the AI's correlation logic. The entire workflow is containerized using Kubernetes for scalability, with prompts, model parameters, and data schemas managed through a central LLMOps platform (e.g., Weights & Biases) to ensure reproducibility and control.
Code & Payload Examples
Detecting Security Incidents via Performance Metrics
AI models can analyze Splunk Observability Cloud metrics (e.g., CPU, memory, network I/O) to detect subtle anomalies indicative of security events like crypto-mining or data exfiltration. The workflow involves querying the Splunk Observability API for metric time series, scoring them with a pre-trained model, and creating a notable event in Splunk Enterprise Security.
Example Python payload for fetching metrics and scoring:
pythonimport requests # Fetch CPU utilization metrics from Splunk Observability Cloud metrics_response = requests.get( 'https://api.us1.signalfx.com/v2/timeserieswindow', headers={'X-SF-TOKEN': os.environ['SFX_TOKEN']}, params={'query': 'data("cpu.utilization").mean()', 'start': '-1h', 'end': 'now', 'resolution': '1m'} ).json() # Process data points for anomaly detection anomaly_scores = your_ai_model.predict(metrics_response['data'][0]['values']) if max(anomaly_scores) > THRESHOLD: # Create a security notable event via Splunk's HTTP Event Collector hec_payload = { "sourcetype": "sfx:anomaly", "event": { "metric": "cpu.utilization", "anomaly_score": max(anomaly_scores), "detection_type": "crypto_mining_suspect", "host": metrics_response['data'][0]['dimensions'].get('host') } }
Realistic Time Savings & Operational Impact
How AI integration transforms Splunk Cloud Observability workflows by correlating security and performance data, moving from reactive monitoring to proactive detection.
| Metric | Before AI | After AI | Notes |
|---|---|---|---|
Mean Time to Detect (MTTD) for cross-domain incidents | Hours to days | Minutes to hours | AI correlates performance anomalies (e.g., high CPU) with security logs (e.g., suspicious process) automatically. |
Manual log correlation for root cause analysis | 1-2 hours per investigation | 10-15 minutes with AI-generated hypotheses | Analyst reviews AI-suggested correlations and evidence, focusing validation. |
Alert volume from standalone monitoring | High, with separate security & ops alerts | Reduced via intelligent clustering & deduplication | AI groups related metrics and logs into single, contextual incidents. |
Time to identify crypto-mining or resource abuse | Next-day review of cost reports | Real-time detection during performance spike | AI detects signature-less threats by linking resource patterns to security events. |
Observability-to-SOC handoff for security incidents | Manual ticket creation & escalation | Automated, enriched incident creation in SOAR/ITSM | AI populates tickets with correlated data from both domains, reducing triage calls. |
Proactive capacity threat identification | Forecast-based on historical trends only | Anomaly-driven forecasts incorporating security context | AI flags capacity risks tied to potential security events (e.g., DDoS preparation). |
Compliance reporting for cross-domain controls | Manual data aggregation from separate searches | Automated report generation with AI-curated evidence | AI maps observability data (access logs) to security frameworks (e.g., NIST). |
Governance, Security & Phased Rollout
Integrating AI with Splunk Cloud Observability requires a deliberate approach to data governance, model security, and incremental rollout to ensure reliability and trust.
Governance starts with defining the data access perimeter for your AI models. In a Splunk Cloud Observability context, this means scoping which telemetry streams—such as metrics, traces, logs, and entities from the Observability Cloud—are accessible for AI analysis. Use Splunk's role-based access controls (RBAC) and data collection rules to create a dedicated service account for the integration, limiting its permissions to read-only access on specific indexes or data streams relevant to security-operations correlation (e.g., infrastructure metrics, application traces, and security-relevant logs). This ensures the AI operates within a least-privilege data plane, preventing accidental mutation of operational data and adhering to internal data sovereignty policies.
For security, the integration architecture must treat the AI as a zero-trust workload. This involves:
- Secure API gateways for all calls between your Splunk Cloud instance and the AI inference endpoints, enforcing authentication, encryption, and audit logging.
- Prompt and output validation to sanitize queries and results, preventing data leakage or injection attacks through the AI interface.
- Model behavior monitoring to detect drift or unexpected outputs that could lead to false correlations between performance anomalies and security events. All AI-generated insights or automated actions (like creating a notable event in Splunk Enterprise Security) should be logged back to a dedicated Splunk index for a complete audit trail, enabling traceability from an AI-generated hypothesis back to the raw observability data.
A phased rollout is critical for adoption and risk management. Start with a human-in-the-loop pilot focused on a single, high-value correlation use case, such as detecting crypto-mining activity manifesting as abnormal CPU metrics on a server group. In this phase, the AI analyzes the observability data and surfaces a narrative summary and confidence score to a security analyst in a dedicated Splunk dashboard or via a Slack/Teams alert. The analyst reviews and manually validates the finding before any action is taken. This builds trust and provides labeled data to refine the model. Subsequent phases can introduce semi-automated workflows, where high-confidence AI correlations automatically create low-severity investigative tickets in your ITSM platform (e.g., ServiceNow) or draft notable events in Splunk ES, always requiring analyst approval before escalation. The final phase, controlled automation, would be reserved for well-understood, high-fidelity patterns, enabling the AI to trigger predefined response playbooks in Splunk SOAR for containment, but only within a tightly defined security policy framework and with continuous oversight via the audit logs.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Frequently Asked Questions
Practical questions about using AI to correlate security and observability data in Splunk Cloud, enabling detection of cross-domain incidents like crypto-mining on a server or performance degradation caused by a security event.
The integration typically works by creating a unified analysis layer that queries both security-relevant indexes (like audit, firewall, endpoint) and observability indexes (like metrics, apm, infrastructure) within the same Splunk Cloud deployment.
- Trigger: A scheduled search or real-time alert from either domain (e.g., a high CPU alert from a metrics index, or a suspicious process creation from a security source).
- Context Pull: The AI agent or workflow executes a correlated search across both data domains. For a CPU alert, it would also query for recent logins, network connections, and process executions on that host from security sources.
- Model Action: A model (LLM or classifier) analyzes the combined dataset to identify if the observability anomaly has a security root cause, or vice-versa. It generates a narrative summary (e.g., "High CPU on server-web-01 correlates with execution of
xmrigprocess and outbound connections to a known crypto-mining pool IP"). - System Update: The finding is written back to Splunk as a Notable Event in Enterprise Security, or creates an incident in a connected ITSM tool like ServiceNow, enriched with the cross-domain context.
- Human Review: The event is routed to a combined SecOps/CloudOps team or a designated fusion analyst for validation and response.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us