Inferensys

Integration

AI Integration for Splunk Cloud Observability

Connect AI models to Splunk Cloud Observability to detect security incidents disguised as performance issues—like crypto-mining on a server—by correlating metrics, logs, and traces across domains.
SRE reviewing LLM observability dashboard on multiple screens, tracing and metrics visible, dark mode monitoring setup.
FROM CORRELATION TO CAUSALITY

Where AI Fits in Splunk Cloud Observability

Integrate AI with Splunk Cloud Observability to move beyond siloed dashboards and connect security incidents to their root causes in infrastructure and application performance.

AI integration for Splunk Cloud Observability focuses on the metrics, traces, and logs ingested into the platform, particularly from sources like APM tools (e.g., Dynatrace, New Relic), infrastructure monitors, and custom application telemetry. The goal is to correlate anomalies in performance KPIs—such as sudden CPU spikes, elevated error rates, or increased API latency—with concurrent security events from your SIEM data. This creates a unified investigative surface where a crypto-mining alert on a virtual machine can be instantly linked to the anomalous process and network telemetry that triggered it, or where a DDoS attack's impact on service availability is quantified in real-time business terms.

Implementation typically involves deploying AI models that analyze time-series data within Splunk's Metric Store and ITSI service models. These models establish behavioral baselines for normal performance and flag deviations. When a significant anomaly is detected, an automated workflow can: 1) Query the Splunk Security Enterprise correlation engine for related security alerts within the same time window and entity scope (host, user, IP). 2) Use a large language model to synthesize a narrative that explains the potential link—for example, 'The 400% increase in outbound network traffic from app-server-05 coincides with a malware detection alert and anomalous spawning of powershell.exe processes.' 3) Enrich the resulting incident in Splunk Mission Control or a connected ITSM tool like ServiceNow with this cross-domain context, priority, and suggested diagnostic SPL searches.

Rollout requires careful governance, as observability data volumes are high. Start by applying AI correlation to a few critical business services defined in Splunk ITSI, where performance directly impacts revenue or operations. Use Splunk's Data Stream Processor or ingest-time SPL to filter and tag relevant telemetry before analysis, controlling cost. Establish clear review workflows; AI-generated hypotheses should aid analyst judgment, not replace it. This integration turns Splunk from a tool of record into a system of insight, where security and operations teams share a single, AI-accelerated view of incidents that manifest across both domains.

PLATFORM SURFACES

Key Integration Surfaces in Splunk Cloud Observability

Service Intelligence (ITSI)

Splunk IT Service Intelligence (ITSI) is the primary surface for correlating performance degradation with potential security incidents. AI integration here focuses on the Service Analyzer, Glass Tables, and KPIs.

Key integration points include:

  • KPI Base Searches: Inject AI models to analyze metric streams (CPU, memory, latency) for subtle anomalies that may indicate crypto-mining, data exfiltration, or lateral movement. This moves beyond static thresholds to behavioral baselining.
  • Episode Review: Use generative AI to automatically summarize multi-KPI episodes, suggesting whether the root cause is likely infrastructure, application, or security-related based on correlated log patterns.
  • Predictive Analytics: Leverage the ITSI Machine Learning Toolkit to forecast service degradation, enabling proactive investigation of anomalies before they trigger user-impacting alerts. This creates a feedback loop where predicted performance issues can be cross-referenced with security threat feeds.

Implementation typically involves embedding model inference into SPL searches that feed KPIs, or using the ITSI API to fetch real-time episode data for external AI processing.

CORRELATING SECURITY AND PERFORMANCE DATA

High-Value AI Use Cases for Splunk Observability

Integrating AI with Splunk Cloud Observability moves beyond siloed monitoring to detect incidents where security threats manifest as performance anomalies, and vice-versa. These use cases focus on correlating metrics, traces, and logs to provide unified, context-rich insights.

01

Anomalous Resource Consumption Detection

AI models analyze CPU, memory, and network I/O telemetry across servers and containers to establish behavioral baselines. Deviations—like a sudden, sustained spike in CPU on a non-critical server—trigger automated investigations that cross-correlate with security logs (e.g., process execution, network connections) to identify crypto-mining, data exfiltration, or malware.

Batch -> Real-time
Detection mode
02

Service Degradation Root Cause Triage

When a business service KPI degrades in Splunk ITSI, an AI agent reviews the underlying infrastructure and application trace data. It generates a ranked list of probable root causes (e.g., a specific microservice, database query, or underlying host) and automatically cross-references recent security events (failed logins, firewall denies) on those components to rule out or highlight malicious activity.

Hours -> Minutes
MTTR impact
03

User Experience & Security Correlation

Correlate Real User Monitoring (RUM) data—like increased page load times or JavaScript errors for a specific user cohort—with authentication and access logs. AI identifies if performance issues are isolated to users who authenticated from new geographies or devices, potentially indicating a credential stuffing attack that is creating abnormal session load.

04

API Latency Anomaly as Attack Signal

Monitor API endpoint latency and error rates from Splunk APM. AI detects unusual patterns, such as specific endpoints slowing down for no clear infrastructure reason, and triggers a security investigation. It automatically queries relevant audit logs for those endpoints to check for brute-force attempts, data scraping, or anomalous payloads that might be causing the degradation.

Proactive Signal
Before full breach
05

Container & Orchestration Behavior Baselining

Apply AI to Kubernetes audit logs, pod lifecycle events, and resource metrics ingested into Splunk. The system learns normal scaling, scheduling, and image pull behavior. It then flags anomalies—like a pod suddenly requesting excessive privileges or communicating externally on a non-standard port—which could indicate a compromised container or a supply chain attack.

06

Unified Incident Narrative Generation

When an alert fires from either the observability or security side of Splunk, an AI workflow automatically pulls relevant context from both domains. It generates a unified incident summary that explains, for example, how a database performance issue (observability) coincides with a surge in failed SQL login attempts (security), providing a complete picture for the on-call engineer or SOC analyst.

Same day
Context assembly
CORRELATING SECURITY AND OBSERVABILITY DATA

Example AI-Driven Workflows

These workflows demonstrate how AI can analyze combined security and performance data in Splunk Cloud to detect sophisticated threats that manifest as operational issues, and vice-versa. Each flow is triggered by telemetry, uses AI to find hidden connections, and drives a concrete action.

Trigger: A Splunk ITSI service health score drops due to sustained high CPU utilization on a subset of application servers.

Context/Data Pulled:

  • The AI agent queries Splunk for:
    • Recent metrics data for the affected hosts (CPU, memory, network I/O).
    • Associated process creation and network connection logs from endpoint agents (e.g., CrowdStrike, Tanium logs ingested into Splunk).
    • Outbound network connections to known crypto mining pool IPs/domains from threat intelligence feeds.

Model/Agent Action: A multi-modal analysis is performed:

  1. Behavioral Correlation: The AI model correlates the spike in CPU usage with the execution of unknown or suspicious processes (e.g., xmrig, cpuminer).
  2. Network Anomaly Detection: It analyzes outbound traffic patterns, flagging connections to IPs with low reputation scores or on non-standard ports commonly used for mining.
  3. Confidence Scoring: The agent generates a high-confidence security incident, linking the performance degradation (observability event) to a confirmed security threat (cryptojacking).

System Update/Next Step:

  • A high-severity Notable Event is automatically created in Splunk Enterprise Security.
  • The incident is enriched with all correlated data: hostnames, process IDs, destination IPs, and the ITSI KPI alert.
  • An Adaptive Response action is triggered to isolate the affected endpoints via integrated EDR tools and block the malicious IPs at the network perimeter.

Human Review Point: The SOC analyst reviews the auto-generated incident narrative and evidence. The AI provides a clear link: "Performance issue caused by unauthorized cryptocurrency mining software." The analyst approves the containment actions and initiates a hunt for the initial compromise vector (e.g., a vulnerable web server).

CORRELATING SECURITY AND OBSERVABILITY DATA

Implementation Architecture: Data Flow & Components

A practical blueprint for integrating AI with Splunk Cloud to detect security incidents manifesting as performance anomalies and vice-versa.

The integration architecture connects AI inference to Splunk Cloud's core data pipeline and search head. The primary flow begins with a scheduled Splunk search or a Data Stream Processor (DSP) query that runs across both security (_audit, wineventlog, ids_data) and observability (metrics, apm_traces, infrastructure_logs) data sources. This search identifies correlation patterns—like a spike in CPU metrics from a server host that coincides with outbound network flow data to a known crypto-mining pool. The search results, including key fields (host, user, source, sourcetype, _time), are passed as a structured JSON payload via the Splunk HTTP Event Collector (HEC) or a secure webhook to an external AI service endpoint hosted by Inference Systems.

Our AI service, built on a RAG (Retrieval-Augmented Generation) pipeline, enriches this data. It first queries a vector database (e.g., Pinecone) containing indexed internal knowledge (past incident reports, asset criticality from a CMDB) and relevant external threat intelligence. A large language model then synthesizes the Splunk data and retrieved context to generate a narrative analysis. This output is a concise, plain-language summary explaining the probable link between the performance degradation and the security event, assessing confidence, and suggesting immediate investigative steps. This analysis is sent back to Splunk via HEC, creating a new ai_correlation_alert event. A Splunk Adaptive Response Action or a Phantom playbook can be triggered by this alert to automatically update a ServiceNow incident, tag the relevant host in the Cortex XDR console, or create a high-priority Splunk Enterprise Security Notable Event for the SOC.

For governance and rollout, we implement this in phases. Phase 1 establishes a read-only integration for a single, high-value use case (e.g., detecting crypto-mining on web servers). All AI-generated outputs are written to a dedicated Splunk index (ai_correlations) with strict RBAC and include an audit trail of the source query and model version. Phase 2 introduces feedback loops, where analyst actions (like closing an incident) are sent back to fine-tune the AI's correlation logic. The entire workflow is containerized using Kubernetes for scalability, with prompts, model parameters, and data schemas managed through a central LLMOps platform (e.g., Weights & Biases) to ensure reproducibility and control.

SPLUNK CLOUD OBSERVABILITY

Code & Payload Examples

Detecting Security Incidents via Performance Metrics

AI models can analyze Splunk Observability Cloud metrics (e.g., CPU, memory, network I/O) to detect subtle anomalies indicative of security events like crypto-mining or data exfiltration. The workflow involves querying the Splunk Observability API for metric time series, scoring them with a pre-trained model, and creating a notable event in Splunk Enterprise Security.

Example Python payload for fetching metrics and scoring:

python
import requests
# Fetch CPU utilization metrics from Splunk Observability Cloud
metrics_response = requests.get(
    'https://api.us1.signalfx.com/v2/timeserieswindow',
    headers={'X-SF-TOKEN': os.environ['SFX_TOKEN']},
    params={'query': 'data("cpu.utilization").mean()',
            'start': '-1h', 'end': 'now', 'resolution': '1m'}
).json()

# Process data points for anomaly detection
anomaly_scores = your_ai_model.predict(metrics_response['data'][0]['values'])
if max(anomaly_scores) > THRESHOLD:
    # Create a security notable event via Splunk's HTTP Event Collector
    hec_payload = {
        "sourcetype": "sfx:anomaly",
        "event": {
            "metric": "cpu.utilization",
            "anomaly_score": max(anomaly_scores),
            "detection_type": "crypto_mining_suspect",
            "host": metrics_response['data'][0]['dimensions'].get('host')
        }
    }
AI-ENHANCED OBSERVABILITY

Realistic Time Savings & Operational Impact

How AI integration transforms Splunk Cloud Observability workflows by correlating security and performance data, moving from reactive monitoring to proactive detection.

MetricBefore AIAfter AINotes

Mean Time to Detect (MTTD) for cross-domain incidents

Hours to days

Minutes to hours

AI correlates performance anomalies (e.g., high CPU) with security logs (e.g., suspicious process) automatically.

Manual log correlation for root cause analysis

1-2 hours per investigation

10-15 minutes with AI-generated hypotheses

Analyst reviews AI-suggested correlations and evidence, focusing validation.

Alert volume from standalone monitoring

High, with separate security & ops alerts

Reduced via intelligent clustering & deduplication

AI groups related metrics and logs into single, contextual incidents.

Time to identify crypto-mining or resource abuse

Next-day review of cost reports

Real-time detection during performance spike

AI detects signature-less threats by linking resource patterns to security events.

Observability-to-SOC handoff for security incidents

Manual ticket creation & escalation

Automated, enriched incident creation in SOAR/ITSM

AI populates tickets with correlated data from both domains, reducing triage calls.

Proactive capacity threat identification

Forecast-based on historical trends only

Anomaly-driven forecasts incorporating security context

AI flags capacity risks tied to potential security events (e.g., DDoS preparation).

Compliance reporting for cross-domain controls

Manual data aggregation from separate searches

Automated report generation with AI-curated evidence

AI maps observability data (access logs) to security frameworks (e.g., NIST).

ARCHITECTING A CONTROLLED DEPLOYMENT

Governance, Security & Phased Rollout

Integrating AI with Splunk Cloud Observability requires a deliberate approach to data governance, model security, and incremental rollout to ensure reliability and trust.

Governance starts with defining the data access perimeter for your AI models. In a Splunk Cloud Observability context, this means scoping which telemetry streams—such as metrics, traces, logs, and entities from the Observability Cloud—are accessible for AI analysis. Use Splunk's role-based access controls (RBAC) and data collection rules to create a dedicated service account for the integration, limiting its permissions to read-only access on specific indexes or data streams relevant to security-operations correlation (e.g., infrastructure metrics, application traces, and security-relevant logs). This ensures the AI operates within a least-privilege data plane, preventing accidental mutation of operational data and adhering to internal data sovereignty policies.

For security, the integration architecture must treat the AI as a zero-trust workload. This involves:

  • Secure API gateways for all calls between your Splunk Cloud instance and the AI inference endpoints, enforcing authentication, encryption, and audit logging.
  • Prompt and output validation to sanitize queries and results, preventing data leakage or injection attacks through the AI interface.
  • Model behavior monitoring to detect drift or unexpected outputs that could lead to false correlations between performance anomalies and security events. All AI-generated insights or automated actions (like creating a notable event in Splunk Enterprise Security) should be logged back to a dedicated Splunk index for a complete audit trail, enabling traceability from an AI-generated hypothesis back to the raw observability data.

A phased rollout is critical for adoption and risk management. Start with a human-in-the-loop pilot focused on a single, high-value correlation use case, such as detecting crypto-mining activity manifesting as abnormal CPU metrics on a server group. In this phase, the AI analyzes the observability data and surfaces a narrative summary and confidence score to a security analyst in a dedicated Splunk dashboard or via a Slack/Teams alert. The analyst reviews and manually validates the finding before any action is taken. This builds trust and provides labeled data to refine the model. Subsequent phases can introduce semi-automated workflows, where high-confidence AI correlations automatically create low-severity investigative tickets in your ITSM platform (e.g., ServiceNow) or draft notable events in Splunk ES, always requiring analyst approval before escalation. The final phase, controlled automation, would be reserved for well-understood, high-fidelity patterns, enabling the AI to trigger predefined response playbooks in Splunk SOAR for containment, but only within a tightly defined security policy framework and with continuous oversight via the audit logs.

AI INTEGRATION FOR SPLUNK CLOUD OBSERVABILITY

Frequently Asked Questions

Practical questions about using AI to correlate security and observability data in Splunk Cloud, enabling detection of cross-domain incidents like crypto-mining on a server or performance degradation caused by a security event.

The integration typically works by creating a unified analysis layer that queries both security-relevant indexes (like audit, firewall, endpoint) and observability indexes (like metrics, apm, infrastructure) within the same Splunk Cloud deployment.

  1. Trigger: A scheduled search or real-time alert from either domain (e.g., a high CPU alert from a metrics index, or a suspicious process creation from a security source).
  2. Context Pull: The AI agent or workflow executes a correlated search across both data domains. For a CPU alert, it would also query for recent logins, network connections, and process executions on that host from security sources.
  3. Model Action: A model (LLM or classifier) analyzes the combined dataset to identify if the observability anomaly has a security root cause, or vice-versa. It generates a narrative summary (e.g., "High CPU on server-web-01 correlates with execution of xmrig process and outbound connections to a known crypto-mining pool IP").
  4. System Update: The finding is written back to Splunk as a Notable Event in Enterprise Security, or creates an incident in a connected ITSM tool like ServiceNow, enriched with the cross-domain context.
  5. Human Review: The event is routed to a combined SecOps/CloudOps team or a designated fusion analyst for validation and response.
Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.