MuleSoft Runtime Fabric provides the operational substrate for running integration applications and APIs. For AI, it becomes the managed environment for hosting inference endpoints, RAG pipelines, and orchestration agents as containerized services. Instead of treating AI as an external API call, Runtime Fabric allows you to deploy custom models—like fine-tuned LLMs, embedding models, or classifiers—as first-class citizens alongside your Mule flows. This shifts AI from a latency-sensitive external dependency to a scalable, internal service with the same governance, networking, and observability as your core integrations.
Integration
AI Integration for MuleSoft Runtime Fabric

Where AI Fits in MuleSoft Runtime Fabric
Deploy, scale, and govern AI workloads within MuleSoft's managed Kubernetes environment for production-grade inference.
The integration surface is the Kubernetes layer itself. Key operational touchpoints include:
- GPU-Aware Scheduling: Configuring node pools and resource requests to schedule AI pods on GPU-equipped workers for cost-efficient batch or real-time inference.
- Service Mesh Integration: Using the built-in service mesh for secure, observable communication between Mule applications and your AI model services, enabling canary deployments and traffic shifting between model versions.
- Horizontal Pod Autoscaling (HPA): Defining custom metrics (e.g., inference queue depth, token-per-second throughput) to automatically scale AI inference pods based on demand from integrated systems like Salesforce or SAP.
- Secrets and Configuration Management: Injecting API keys, model weights, and prompt templates as Kubernetes secrets or config maps, managed through the same CI/CD pipelines as your Mule applications.
- Unified Logging and Monitoring: Streaming AI service logs (e.g., from TensorFlow Serving or vLLM) into the same MuleSoft-managed monitoring stack, creating a single pane for tracing a business transaction from a CRM trigger through data enrichment to an AI-generated response.
Rollout and governance follow platform-native patterns. You package your AI service—including its runtime, model files, and a lightweight REST or gRPC interface—as a Docker image. Using MuleSoft's deployment descriptors, you define resource limits, health checks, and ingress rules. The AI service is then exposed internally to your MuleSoft Anypoint Platform integrations via a Kubernetes Service, allowing Mule flows to call it using a standard HTTP Request connector. For governance, you apply the same RBAC, network policies, and compliance audits to AI pods as you would to any other workload on Runtime Fabric. This operational consistency is critical for enterprises that need to prove model lineage, data residency, and controlled access for AI-powered integrations in regulated industries.
AI Deployment Surfaces in Runtime Fabric
Scheduling AI Inference Workloads
Runtime Fabric's managed Kubernetes environment allows you to define GPU resource requests and limits for AI inference pods. This is critical for deploying models like Llama 2, GPT-4, or custom fine-tuned models that require NVIDIA A100, V100, or T4 GPUs.
Key Configuration Points:
- Use
nvidia.com/gpuresource requests in your pod spec to guarantee access to GPU nodes. - Configure Horizontal Pod Autoscaling (HPA) based on custom metrics like inference request latency or queue depth, not just CPU.
- Leverage node selectors and tolerations to ensure AI workloads are scheduled on GPU-enabled worker nodes, isolating them from general integration runtimes.
This surface enables predictable, high-throughput inference for real-time API calls from your MuleSoft flows, avoiding cold starts and maintaining SLAs.
High-Value AI Use Cases for Runtime Fabric
Runtime Fabric provides the managed Kubernetes substrate for MuleSoft's integration layer. These use cases focus on injecting AI directly into the operational plane—optimizing workload placement, scaling, and resilience for AI-enhanced integrations.
Intelligent GPU-Aware Scheduling
Deploy and scale GPU-intensive AI inference workloads (e.g., vision models, large language models) alongside traditional integration runtimes. Runtime Fabric's scheduler can be extended with custom policies to bin-pack GPU workloads efficiently, prioritize latency-sensitive inference pods, and manage spot instance fallbacks for cost optimization.
AI Workload Autoscaling with Prometheus Metrics
Move beyond CPU/memory scaling for AI containers. Implement custom Horizontal Pod Autoscalers (HPAs) that react to AI-specific metrics like inference queue length, model latency percentiles (p95/p99), or token-per-second throughput. This ensures AI-enhanced APIs served from Runtime Fabric maintain SLAs under variable load.
Canary Releases for AI Model Endpoints
Safely roll out new versions of AI models (e.g., upgrading from GPT-4 to GPT-4o) by deploying them as separate services within Runtime Fabric. Use Istio-based traffic splitting managed by Runtime Fabric to route a percentage of API traffic to the new model, monitor for performance regressions or quality drift, and automate rollback based on business metrics.
Unified Logging & Tracing for AI-Integration Pipelines
Correlate logs and traces across MuleSoft flows and co-located AI microservices. Instrument AI containers to emit OpenTelemetry traces, feeding into Runtime Fabric's observability stack. This provides end-to-end visibility when an API call triggers a DataWeave transformation, a call to an LLM, and a response enrichment—critical for debugging complex, AI-augmented workflows.
Cost-Optimized Hybrid Inference Routing
Deploy smaller, faster models (e.g., for classification) locally on Runtime Fabric, while routing complex generative requests to external cloud AI services (OpenAI, Azure AI). Implement a routing layer within Runtime Fabric that makes dynamic decisions based on request content, latency requirements, and current cost profiles, all managed within the same security and networking perimeter.
AI-Enhanced Health Checks & Self-Healing
Extend Kubernetes liveness/readiness probes for AI containers. Implement probes that perform lightweight inference to validate model responsiveness and output sanity, not just TCP connectivity. Pair this with Runtime Fabric's restart policies to automatically recycle pods where model performance degrades, ensuring high availability for AI-powered integration endpoints.
Example AI-Enhanced Integration Workflows
These workflows illustrate how to embed AI inference directly into MuleSoft Runtime Fabric, turning your integration layer into an intelligent orchestration engine. Each pattern focuses on operational efficiency, resource optimization, and seamless integration with your existing MuleSoft assets.
Trigger: A scheduled MuleSoft flow initiates a nightly batch job for customer sentiment analysis on support ticket data.
Context/Data Pulled: The flow queries Salesforce Service Cloud for the day's closed tickets, extracting the case description and comments.
Model/Agent Action:
- The flow packages the data and submits a job request to a Kubernetes Job Controller deployed alongside your MuleSoft applications on Runtime Fabric.
- The controller, aware of GPU node labels and current load, schedules the job on a node with available NVIDIA GPU resources.
- A containerized inference service (e.g., a fine-tuned sentiment model) processes the batch, returning scores and key phrases.
System Update: The MuleSoft flow receives the results, enriches the original Salesforce Case records with the sentiment score, and logs the job metrics (duration, GPU utilization) to Datadog or Dynatrace.
Human Review Point: Tickets flagged with 'Critical Negative' sentiment are automatically routed to a dedicated queue in Service Cloud for manager review the next morning.
Implementation Architecture: Wiring AI into Runtime Fabric
Deploying and scaling AI inference workloads within MuleSoft's managed Kubernetes environment requires a deliberate architecture for performance, cost, and resilience.
MuleSoft Runtime Fabric provides a managed Kubernetes layer for deploying integration applications and APIs. To wire AI into this environment, you treat AI models as containerized services—similar to a custom connector or microservice—but with distinct resource requirements. The core architectural pattern involves:
- Deploying AI inference containers (e.g., TensorFlow Serving, Triton Inference Server, or custom FastAPI apps wrapping cloud LLM SDKs) as separate pods within the same Runtime Fabric cluster.
- Exposing models as internal services via Kubernetes
Serviceobjects, allowing your MuleSoft flows to call them over HTTP/gRPC from within the cluster network, avoiding public internet latency and egress costs. - Managing GPU resources via Kubernetes node pools and resource requests/limits (
nvidia.com/gpu) to ensure predictable performance for compute-intensive models like vision or large language models. - Using MuleSoft's application networking (Anypoint Runtime Manager) to apply policies—like rate limiting, client ID enforcement, or mutual TLS—even to internal AI service calls, maintaining a consistent governance layer.
In practice, a MuleSoft flow acts as the orchestration controller. It receives an API request, performs necessary data validation and transformation using DataWeave, and then calls the appropriate AI service endpoint. For example:
- A flow handling customer support tickets could call a sentiment analysis model to prioritize escalations.
- An order processing flow might invoke a fraud detection model before committing the transaction.
- A product API could use a vector search service (deployed as a separate pod with a vector database like Qdrant) to power semantic product recommendations.
The key is to keep the AI inference stateless and idempotent, with all session or context management handled by the Mule application. This allows the AI pods to scale horizontally based on HPA (Horizontal Pod Autosplitting) metrics like CPU/GPU utilization or request queue depth. Runtime Fabric's built-in observability (logs, metrics, traces) can then be extended to monitor AI service health and latency, creating a unified view of the intelligent integration pipeline.
Rollout and governance require careful staging. Start by deploying AI models to a dedicated namespace or worker node pool in Runtime Fabric to isolate their resource impact. Use Kubernetes ResourceQuotas to prevent AI workloads from starving core integration applications. For model updates, implement a blue-green or canary deployment strategy using Kubernetes Deployment objects, routing a percentage of traffic from your Mule flows to the new model version and monitoring for errors or performance drift. Finally, integrate this architecture with your MLOps lifecycle: use CI/CD pipelines to build and push model container images to a registry (like ECR or GCR), and leverage tools like KServe or Seldon Core—deployed alongside Runtime Fabric—for advanced model serving features like explainability (XAI) and automated canary analysis. This approach turns Runtime Fabric from a pure integration runtime into a governed, scalable platform for operational AI.
Code and Configuration Examples
Configuring AI Model Pods for GPU Scheduling
Runtime Fabric's managed Kubernetes environment allows you to request and schedule GPU resources for inference workloads. The key is defining the correct resource requests and limits in your application's deployment spec to ensure the MuleSoft scheduler places pods on nodes with available accelerators.
Use nvidia.com/gpu as the resource type. Specify the exact count needed for your model's batch size and latency requirements. This example shows a pod spec for a container running a PyTorch model, requesting a single NVIDIA GPU and ensuring the node selector matches a GPU-equipped worker pool.
yamlapiVersion: v1 kind: Pod metadata: name: llm-inference-pod namespace: mulesoft-apps spec: containers: - name: text-generation image: your-registry/llm-service:latest resources: requests: nvidia.com/gpu: 1 memory: "16Gi" cpu: "4" limits: nvidia.com/gpu: 1 memory: "24Gi" cpu: "8" env: - name: MODEL_PATH value: "/models/mistral-7b" nodeSelector: cloud.google.com/gke-accelerator: nvidia-tesla-t4
Operational Impact and Time Savings
This table compares the operational impact of deploying and managing AI inference workloads on MuleSoft Runtime Fabric before and after implementing AI-native orchestration and resource optimization.
| Operational Area | Before AI Integration | After AI Integration | Key Notes |
|---|---|---|---|
GPU Resource Scheduling | Manual node selection and static allocation | Dynamic, policy-driven scheduling based on workload | Optimizes expensive GPU utilization; reduces idle time |
Model Deployment & Rollout | Manual YAML/Helm chart updates per environment | GitOps-driven deployment with automated canary analysis | Reduces deployment errors; enables safe, incremental rollouts |
Scaling Decision Latency | Reactive scaling based on CPU/memory metrics (minutes) | Predictive scaling using inference queue depth and latency SLOs (seconds) | Maintains consistent response times under variable AI request loads |
Multi-Model Cost Optimization | Fixed resource reservations per model, leading to over-provisioning | Intelligent bin-packing and shared GPU inference pools | Lowers cloud infrastructure spend by increasing density |
Health Monitoring & Remediation | Alert fatigue from generic K8s metrics; manual triage | AI-specific health checks (e.g., model staleness, output drift) with auto-remediation runbooks | Improves mean time to recovery (MTTR) for AI service degradation |
Security & Compliance Patching | Manual, periodic vulnerability scans and node updates | Continuous, policy-enforced scanning with automated, zero-downtime node rotations | Reduces security exposure window for AI workload dependencies |
Capacity Planning | Quarterly forecasting based on historical growth trends | Real-time forecasting using API traffic patterns and planned model releases | Enables just-in-time procurement and avoids provisioning delays |
Governance, Security, and Phased Rollout
Deploying AI workloads on Runtime Fabric requires a deliberate approach to resource governance, security posture, and controlled rollout to ensure stability and ROI.
Runtime Fabric's managed Kubernetes environment provides the control plane for governing AI inference. Key operational surfaces include:
- GPU Scheduling and Quotas: Defining resource requests/limits (
nvidia.com/gpu) per AI service pod to prevent resource starvation and control cloud spend. - Horizontal Pod Autoscaling (HPA): Configuring HPA based on custom metrics (e.g., inference queue depth, request latency) to scale AI model replicas dynamically with traffic.
- Network Policies: Isolating AI model endpoints within the cluster, restricting ingress to only authorized MuleSoft applications or external API gateways like Kong.
- Runtime Security: Integrating with image scanning and admission controllers to ensure only approved, vulnerability-free container images (e.g., PyTorch, TensorFlow Serving) are deployed.
A production integration follows a phased rollout to mitigate risk and validate value:
- Shadow Mode: Deploy the AI service (e.g., a document classifier) and have it process real payloads from a MuleSoft flow, but route decisions based on the existing logic. Log AI outputs for accuracy evaluation without impacting business processes.
- Canary Release: Route a small percentage (e.g., 5%) of specific, non-critical traffic through the AI-driven decision path within a MuleSoft application. Use Runtime Fabric's deployment strategies and Istio-based traffic splitting for controlled exposure.
- Controlled Expansion: Gradually increase traffic volume and expand to new use cases (e.g., from invoice data extraction to contract clause analysis) based on performance SLAs and business validation. Implement circuit breakers in MuleSoft flows to failover to legacy processes if AI service latency or error rates breach thresholds.
- Automated Operations: Embed AI service health checks into existing APM and alerting (via Runtime Fabric metrics integration) and establish automated rollback procedures for model regression.
Security is enforced at multiple layers. All calls from MuleSoft flows to AI endpoints use mutual TLS (mTLS) for service-to-service authentication within the cluster. For external AI services (e.g., Azure OpenAI, AWS Bedrock), credentials are managed via MuleSoft's secure properties and never hard-coded. Audit trails are maintained by logging all inference requests (with PII redaction) and decisions to a centralized SIEM, enabling traceability from business event in Anypoint Platform to AI-generated output. This layered governance ensures AI workloads on Runtime Fabric are scalable, secure, and surgically integrated into mission-critical integration pipelines.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
FAQ: AI on MuleSoft Runtime Fabric
Practical questions for teams deploying and scaling AI inference workloads on MuleSoft's managed Kubernetes environment.
Runtime Fabric (RTF) manages standard Kubernetes clusters, but GPU scheduling requires specific node pool configuration and resource declarations.
Typical Implementation Pattern:
- Provision GPU Nodes: Work with your cloud provider (AWS, Azure, GCP) to attach GPU-enabled node pools to your RTF cluster. This is often done via the MuleSoft Runtime Manager console or infrastructure-as-code.
- Define Resource Requests: In your Mule application's
mule-artifact.jsonor deployment descriptor, specify GPU resource requirements:json"deploymentSettings": { "resources": { "limits": { "nvidia.com/gpu": "1" } } } - Use Taints and Tolerations: Apply taints to GPU nodes (e.g.,
gpu=true:NoSchedule) and matching tolerations in your AI service deployment to ensure workloads land on the correct nodes. - Consider Model Servers: For production, deploy dedicated model-serving containers (e.g., Triton Inference Server, KServe) alongside your Mule apps, using RTF's service mesh for internal routing, rather than embedding heavy models directly in Mule runtimes.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us