The Cortex Data Lake API serves as the primary integration point, allowing AI systems to query petabytes of normalized network, endpoint, and threat logs without impacting the real-time processing of Cortex XDR or XSIAM. Key data objects for AI analysis include firewall_traffic_logs, threat_logs, url_logging_logs, and correlation_logs. By treating CDL as a historical data fabric, AI models can perform longitudinal analysis—detecting low-and-slow attack patterns, establishing sophisticated behavioral baselines across quarters, and identifying policy drift or shadow IT that evades real-time detection.
Integration
AI Integration for Palo Alto Cortex Data Lake

Where AI Fits into Cortex Data Lake
Integrating AI with Palo Alto Networks Cortex Data Lake (CDL) transforms your long-term log repository into an active threat hunting and model training asset.
Implementation involves deploying a dedicated AI inference layer that polls the CDL API using time-range and filtered queries. This layer executes two core workflows: 1) Bulk Threat Hunting, where models analyze months of data to uncover advanced persistent threats (APTs) or lateral movement patterns, generating hypotheses and exporting relevant log subsets for analyst review; and 2) Synthetic Training Data Generation, where the AI anonymizes and structures real CDL logs to create high-fidelity, labeled datasets for training custom detection models specific to your environment. This moves beyond pre-packaged analytics to models that understand your unique network topology and business context.
Governance is critical. Rollout should start with read-only API access and sandboxed log subsets. Implement strict data minimization and anonymization protocols for training data workflows. AI-generated hunting leads must feed back into Cortex XDR as XQL queries or external alerts for validation and action within the primary SOC workflow, creating a closed-loop system. This architecture ensures AI augments—rather than bypasses—existing security processes and audit trails.
For related architectural patterns on using AI with security data lakes, see our guide on AI Integration for Splunk Data Fabric Search and considerations for AI Governance and LLMOps Platforms.
Key Integration Surfaces in Cortex Data Lake
Querying Historical Data for Threat Hunting
Cortex Data Lake (CDL) serves as the central, scalable repository for logs from Palo Alto Networks firewalls (Strata), Prisma Access, and Cloud-Delivered Security Services. AI integration targets the vast historical data here—often petabytes—for retrospective threat hunting and trend analysis without impacting real-time processing.
Key surfaces include:
- Log Types: Traffic, Threat, URL, WildFire, Data Filtering, and Tunnel Inspection logs.
- Query APIs: The
loggingserviceAPI and Query Service API enable bulk retrieval of logs based on complex filters (time range, source/destination, application, threat ID). - Use Case: An AI agent can be triggered by a new threat intel report to query CDL for historical matches of IOCs or suspicious patterns over the last 90+ days, reconstructing a potential breach timeline.
This layer is ideal for training custom detection models on your organization's unique traffic patterns and threat history.
High-Value AI Use Cases for CDL
Cortex Data Lake (CDL) stores petabytes of structured log data from Palo Alto Networks firewalls, Prisma Access, and Cloud NGFW. These use cases show how to apply AI to this historical data for proactive security, moving beyond real-time alerting to uncover hidden patterns and build smarter defenses.
Historical Attack Chain Reconstruction
Use AI to query months of CDL traffic and threat logs to reconstruct full attack timelines post-incident. Models correlate seemingly isolated events (e.g., an initial beacon, lateral movement, data staging) across time and assets to reveal the complete kill chain, providing lessons for detection engineering.
Trend-Based Threat Hunting
Deploy AI to analyze CDL data for subtle trends in denied traffic, URL categories, or application usage that indicate emerging threats (e.g., a new phishing campaign or C2 domain pattern). This shifts hunting from keyword searches to identifying statistical anomalies in long-term data.
Custom Detection Model Training
Use CDL as a labeled dataset to train organization-specific ML models. For example, extract features from allowed vs. denied session logs over 12+ months to create a bespoke model for detecting suspicious internal traffic that evades standard policies, then deploy it back to the firewall or Cortex XDR.
Policy Optimization & Clean-up
Apply AI to analyze firewall policy hit counts and application/URL usage patterns from CDL. Identify redundant, shadow, or overly permissive rules. Generate actionable recommendations for rule consolidation and risk reduction, directly tied to historical traffic evidence.
Data Exfiltration Pattern Discovery
Mine CDL for patterns indicative of data theft, such as regular, small data transfers to unusual geographies or to newly seen domains. AI models baseline normal outbound traffic volumes and protocols, flagging subtle, low-and-slow exfiltration attempts that occurred over months.
Compliance & Audit Evidence Generation
Automate the generation of evidence for regulatory audits (PCI DSS, HIPAA) by using AI to query CDL for relevant traffic flows, policy configurations, and threat blocks over the audit period. Produce summarized reports and chain-of-custody documentation from the raw logs.
Example AI-Driven Workflows
These workflows demonstrate how to augment long-term threat hunting and model development in Cortex Data Lake with generative AI and retrieval-augmented generation (RAG). Each pattern connects AI analysis to specific CDL data surfaces and operational outcomes.
Trigger: A threat hunter formulates a hypothesis in plain language (e.g., "Find instances where a user account accessed sensitive file shares shortly after authenticating from a new country").
Context/Data Pulled:
- The AI agent translates the hypothesis into a structured Cortex XQL query.
- The query targets CDL tables like
xdr_data(for endpoint process/network events) andauth_data(for authentication logs), joining on common identifiers likeactor_usernameandtimestamp. - The query filters for a defined time window (e.g., last 90 days) and includes logic for geographic anomaly detection.
Model or Agent Action:
- The AI system executes the generated XQL query via the Cortex Data Lake API.
- Raw results are passed to an LLM with a prompt to: summarize findings, highlight the most suspicious sequences, and estimate the potential impact (e.g., data exfiltration risk).
System Update or Next Step:
- The AI generates a summary report and a list of high-fidelity
host_idandactor_usernamevalues. - These entities are automatically added to a Cortex XDR watchlist or used to create a new XDR investigation for immediate analyst review.
Human Review Point: The final summary and list of suspect entities are presented to the threat hunter for validation before any watchlist updates or investigation creation occurs.
Typical Implementation Architecture
A production-ready architecture for connecting AI models to Palo Alto Cortex Data Lake focuses on creating a secure, governed data pipeline that feeds long-term log data to custom models without impacting real-time security operations.
The core pattern involves a dedicated query and extraction pipeline that pulls historical data from the Cortex Data Lake API based on a scheduled hunt or model training need. This is typically implemented as a containerized service (e.g., in Kubernetes) that authenticates via OAuth 2.0, executes XQL queries for specific log types (like traffic, threat, url, or panorama), and streams the results to a staging area in cloud object storage (e.g., AWS S3, Azure Blob). This decouples the extraction from the Data Lake's primary function of serving real-time queries to Cortex XDR and XSIAM, preventing performance contention. The extracted logs are then transformed—normalizing timestamps, anonymizing sensitive fields like internal IPs if needed for training, and converting to a model-friendly format like Parquet.
From the staging bucket, data flows into two primary AI workloads: 1) Custom Detection Model Training and 2) Proactive Threat Hunting. For training, a feature engineering job (using Spark or a similar framework) runs on the historical dataset to create features like source-destination pair frequency, application usage trends, or bytes transferred anomalies. These features train a model (scikit-learn, PyTorch) hosted in a separate inference service. For hunting, a Retrieval-Augmented Generation (RAG) pipeline indexes the log data into a vector database (e.g., Pinecone, Weaviate) using embeddings of key fields (url, rule_name, source_address). Security analysts can then query this index in natural language ("show me traffic to newly registered domains in the last 90 days") through a secure copilot interface, which retrieves relevant log snippets and generates a summary.
Governance and rollout are critical. This architecture requires strict RBAC and audit trails on the extraction service to track who queried what data and when. Model outputs—whether new detection rules or hunt findings—should feed back into the operational security stack via approved channels. For example, a high-confidence anomaly detected by a custom model could be packaged as a new Cortex XDR XQL query and pushed to a staging folder for SOC lead review before promotion to active detection. The entire pipeline should be deployed incrementally, starting with a single log type (e.g., threat logs) and a non-critical use case to validate data quality, cost, and value before scaling to petabyte-scale historical analysis.
Code and Payload Examples
Automating Query Creation for Threat Hunting
Use AI to translate natural language hunt hypotheses into valid Cortex XDR Query Language (XQL). This accelerates investigation by allowing analysts to describe what they're looking for in plain English, rather than manually constructing complex queries.
Example Workflow:
- Analyst provides a prompt: "Find processes on finance department endpoints that made outbound connections to new IPs in the last 24 hours."
- AI generates the corresponding XQL, handling table joins, time windows, and filter logic.
- The query is executed against Cortex Data Lake via API, returning results for analyst review.
This pattern reduces the barrier to proactive hunting and ensures consistent, optimized query syntax.
Realistic Time Savings and Operational Impact
How AI integration with Cortex Data Lake shifts long-term security analytics from manual, periodic reviews to continuous, automated intelligence generation.
| Workflow | Before AI | After AI | Key Notes |
|---|---|---|---|
Historical Attack Pattern Discovery | Manual query building and iterative analysis over weeks | AI-generated hypotheses and pattern detection in days | Analyst reviews AI-suggested correlations, validates findings |
Trend Analysis for Executive Reporting | Manual data aggregation and chart creation monthly/quarterly | Automated report generation with narrative summaries weekly | Human oversight for business context and strategic messaging |
Custom Detection Model Training Data Curation | Manual log sampling and labeling, taking 2-3 weeks per model | AI-assisted log clustering and automated labeling, reducing to 3-5 days | Security data scientist reviews and refines AI-suggested labels |
Threat Hunting for Dormant IOCs | Ad-hoc searches based on recent intel; coverage gaps likely | Continuous, scheduled hunting across 12+ months of logs | AI surfaces potential matches; analyst confirms and initiates response |
Compliance Evidence Gathering | Manual search and extraction for audit periods | AI-driven query to pull relevant logs and session data | Auditors receive pre-filtered datasets; legal/GRC reviews output |
Baseline Establishment for User/Entity Behavior | Statistical analysis on sampled data, updated quarterly | Dynamic behavioral modeling on full dataset, updated continuously | AI identifies drift; security team reviews anomalies for policy tuning |
Proactive Threat Research & Hypothesis Testing | Time-intensive, often deprioritized for urgent incidents | Dedicated AI "research assistant" runs parallel hypothesis tests | Frees senior threat hunters to focus on high-value investigation |
Governance, Security, and Phased Rollout
Integrating AI with Palo Alto Cortex Data Lake requires a deliberate approach to data governance, model security, and controlled deployment to realize its full potential for threat hunting and model training.
The primary architectural consideration is ensuring AI workloads query the Cortex Data Lake API without impacting the real-time alerting and processing of Cortex XDR or XSIAM. This is achieved by implementing a dedicated middleware layer that handles authentication, rate limiting, and query optimization. This layer executes bulk queries—such as retrieving months of DNS logs or firewall traffic for trend analysis—during off-peak hours, caches results in a separate analytics store (like a vector database), and feeds curated datasets to AI models. This separation of concerns keeps production security operations performant while enabling deep, historical analysis.
Data governance is critical. Before feeding data to an LLM or custom ML model, you must implement strict data anonymization and filtering at the API call level. This involves redacting internal IP addresses, hostnames, and user identifiers from logs used for open-model inference or stripping all PII before using logs as training data for custom detection models. Access to the middleware and AI tools should be controlled via the same Role-Based Access Control (RBAC) principles used for the Cortex platform itself, with audit trails logging every query made to the Data Lake for compliance and forensic review.
A phased rollout mitigates risk and builds confidence. Start with a read-only, human-in-the-loop phase focused on a single use case, such as using an LLM to summarize quarterly threat hunting reports based on Data Lake queries. In Phase 2, move to assisted automation, where AI suggests new correlation rules or AQL queries for analysts to review and approve. The final phase involves closed-loop training, where approved, anonymized log data is used to fine-tune a specialized model (e.g., for detecting subtle data exfiltration patterns), with continuous validation against a holdback dataset to monitor for model drift. This measured approach ensures each step delivers tangible value while maintaining strict oversight over data usage and model outputs.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Frequently Asked Questions
Common questions about using AI and large language models to unlock long-term threat intelligence, trend analysis, and custom model training from the vast data stored in Palo Alto Networks Cortex Data Lake.
AI integrations are designed to operate on a dedicated, read-only data pipeline separate from the primary SIEM ingestion and alerting streams.
Typical Architecture:
- Data Replication: Logs and enriched events are streamed to Cortex Data Lake. A secondary, filtered feed (e.g., specific log types, normalized fields) is replicated to a separate analytics environment (like a data warehouse or object store).
- AI Processing Layer: AI models and agents query this replicated dataset. This ensures compute-intensive operations like full-text semantic search, time-series forecasting, or bulk data labeling do not compete for resources with Cortex XDR or XSIAM's real-time detection engines.
- Results Integration: Insights generated by AI (e.g., newly identified threat patterns, labeled training data) are written back to Cortex Data Lake via its API as custom context or to a separate index, where they can be referenced by detection rules and investigations.
This separation of concerns maintains the performance and reliability of your primary security operations while enabling deep, historical analysis.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us