Glossary

Kubernetes Deployment

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling.

Get in touch Learn more

Developer testing AI inference on mobile phone in hand, laptop with optimization code visible, casual tech review moment.

MODEL SERVING ARCHITECTURES

What is Kubernetes Deployment?

A core Kubernetes object for declaratively managing the lifecycle of containerized applications, such as model inference services.

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical Pods running a containerized application, such as a model inference service. It automates critical operational tasks including rolling updates, rollbacks, and scaling, ensuring the specified number of replica Pods are always running and healthy. This abstraction is fundamental for deploying stateless, scalable microservices.

For machine learning, a Deployment YAML file defines the container image (e.g., a Triton Inference Server), resource requests/limits, and environment variables. The Deployment controller continuously reconciles the actual state with this declared state, handling node failures by rescheduling Pods. It enables canary and blue-green deployments for safe model updates and integrates with Horizontal Pod Autoscaler for demand-based scaling, forming the backbone of resilient model serving infrastructure.

MODEL SERVING ARCHITECTURES

Key Features of Kubernetes Deployments

A Kubernetes Deployment provides declarative updates and lifecycle management for a set of identical Pods running a containerized application, such as a model inference service. It is the primary controller for managing stateless applications in Kubernetes.

Declarative Desired State

A Deployment's core function is to declaratively manage a desired state for your application. You define the state in a YAML manifest (e.g., 3 replicas, container image my-model:v2), and the Deployment's controller continuously reconciles the actual cluster state to match. This automation is critical for maintaining the availability of inference endpoints.

Example: A manifest specifies replicas: 5 and image: inference-service:3.1.0. The controller ensures exactly five Pods are always running with that image.
Mechanism: The controller acts on differences between the .spec (desired) and .status (actual) fields, creating or deleting Pods via a ReplicaSet.

Rolling Updates and Rollbacks

Deployments enable zero-downtime updates through a rolling update strategy. When you update the Pod template (e.g., a new model version), the controller incrementally replaces old Pods with new ones, ensuring a subset of replicas remains available to serve traffic. This is essential for deploying new model versions without interrupting service.

Strategy Controls: You can configure maxUnavailable and maxSurge to control the pace of the rollout.
Automatic Rollback: If a update fails health checks, the Deployment can automatically roll back to the previous stable version, minimizing risk.
Version History: Kubernetes maintains a revision history, allowing manual rollback to any previous Deployment revision.

Pod Replication and Scaling

Deployments ensure high availability and scale by managing multiple identical Pod replicas. The replica count is a core spec field. Scaling is managed manually via kubectl scale or automatically by a Horizontal Pod Autoscaler (HPA) based on CPU/memory usage or custom metrics like inference request queue length.

High Availability: Distributing replicas across cluster nodes protects against node failures.
Load Distribution: A Kubernetes Service load-balances incoming inference requests across all healthy Pod replicas.
Example: An HPA policy can scale a model deployment from 3 to 10 replicas during peak inference demand, then scale back down to conserve resources.

Health Monitoring with Probes

Deployments rely on readiness and liveness probes to manage Pod lifecycle and ensure traffic is only sent to healthy endpoints. This is critical for model serving, where initialization (loading large models) can take significant time.

Readiness Probe: Determines if a Pod is ready to receive traffic. A model server pod should only be marked ready after the model is fully loaded into GPU memory.
Liveness Probe: Determines if a Pod is running correctly. If it fails, the Pod is restarted. This can recover a model server that has hung or crashed.
Startup Probe: Used for legacy applications that require extra long startup times, allowing separate configuration for initial startup health checks.

Resource Management and Limits

Deployments allow precise specification of resource requests and limits for CPU and memory (including GPU). This is fundamental for cost control and performance stability in inference clusters, preventing a single greedy model from starving others.

Requests: Guaranteed resources reserved for the Pod (e.g., 4 CPUs, 16Gi memory, 1 nvidia.com/gpu). The scheduler uses this to place Pods on suitable nodes.
Limits: The maximum resources a Pod can use. Exceeding a memory limit causes the Pod to be terminated (OOMKilled).
Quality of Service (QoS): Based on these settings, Kubernetes assigns Pods a QoS class (Guaranteed, Burstable, BestEffort), influencing scheduling and eviction priority.

Integration with Ecosystem Tools

Deployments are not used in isolation; they are the foundational object integrated with higher-level Kubernetes ecosystem tools specifically designed for ML serving, providing advanced capabilities like canary deployments, inference graphs, and specialized hardware acceleration.

KServe/Knative Serving: Provides serverless, auto-scaling capabilities with request-level metrics and canary rollouts on top of Deployments.
Seldon Core: Uses Deployments to orchestrate complex inference graphs (pre-processing -> model A -> model B -> post-processing).
NVIDIA GPU Operator/Kubernetes Device Plugins: Manages GPU resource scheduling and drivers, allowing Deployments to request nvidia.com/gpu resources.
Service Meshes (Istio, Linkerd): Provide advanced traffic routing, mirroring, and security between Deployment Pods for sophisticated canary and A/B testing of models.

MODEL SERVING ARCHITECTURES

How a Kubernetes Deployment Works

A Kubernetes Deployment is a declarative API object that defines a desired state for a replicated application. You specify the container image, number of replicas (pods), and update strategy. The Deployment controller continuously observes the cluster, reconciling the actual state to match your declared intent by creating, scaling, or updating pods. This automation is foundational for model serving architectures, ensuring a specified number of inference server instances are always running.

For updates, the controller orchestrates a rolling update by incrementally replacing old pods with new ones, maintaining service availability—critical for canary or blue-green deployments. It also enables auto-scaling via the Horizontal Pod Autoscaler based on metrics like CPU usage. This declarative, self-healing management is essential for ML Ops, providing a robust, scalable foundation for deploying and maintaining model inference services in production.

KUBERNETES DEPLOYMENT PATTERNS

Deployment Update Strategies for Model Serving

A comparison of Kubernetes strategies for updating model inference services, balancing availability, rollback speed, and resource overhead.

Strategy	Rolling Update	Recreate	Blue-Green	Canary
Update Mechanism	Gradual pod replacement	Terminate all, then create new	Full parallel environment swap	Controlled traffic split to new version
Downtime
Rollback Speed	Medium (pods must cycle)	Fast (single update)	Instant (traffic switch)	Instant (traffic switch)
Resource Overhead	Low (temporary 2x pods)	None	High (2x full environment)	Medium (partial 2x pods)
Traffic Control Granularity	Pod-level	N/A	Environment-level	Request-level (e.g., 5%)
Testing in Production
Kubernetes Native Support
Typical Use Case	General stateless updates	Breaking changes, schema migrations	Zero-downtime major releases	Performance/accuracy validation

KUBERNETES DEPLOYMENT

Frequently Asked Questions

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical Pods running a containerized application, such as a model inference service. It works by continuously reconciling the actual state of the cluster with the state defined in its manifest file (typically YAML). The Deployment controller automatically handles pod creation, scaling, updates, and rollbacks. For a model serving use case, you define the container image (e.g., a Triton Inference Server), resource requests/limits (CPU, GPU memory), and the number of replicas. Kubernetes then ensures that the specified number of pods are always running and healthy, restarting them if they fail.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

KUBERNETES DEPLOYMENT

Related Terms

A Kubernetes Deployment is a core controller for managing stateless applications. Understanding these related concepts is essential for building robust, scalable model serving infrastructure.

Pod

The smallest and simplest Kubernetes object. A Pod represents a single instance of a running process (or group of tightly coupled processes) in your cluster.

Ephemeral by nature: Pods are created and destroyed to match the state specified by a Deployment.
Contains one or more containers: For ML serving, this is typically your model inference server (e.g., a container running Triton) and potentially a sidecar for logging or monitoring.
Basic unit of deployment: A Deployment's primary function is to create and manage identical Pods.

ReplicaSet

A ReplicaSet is a lower-level controller that ensures a specified number of identical Pod replicas are running at any given time. A Kubernetes Deployment manages updates by creating and managing ReplicaSets.

Scaling foundation: The replicas field in a Deployment manifest defines the desired Pod count, which the ReplicaSet enforces.
Pod lifecycle management: If a Pod fails or is deleted, the ReplicaSet creates a replacement to maintain the desired count.
Update mechanism: During a rolling update, the Deployment creates a new ReplicaSet with the updated Pod template and scales it up while scaling the old one down.

Service

A Kubernetes Service is an abstraction that defines a logical set of Pods and a policy to access them. It provides a stable network endpoint (IP address and DNS name) for your model inference Pods, which are ephemeral.

Load balancing: Distributes traffic across all healthy Pods matching the Service's selector.
Decouples network identity: Clients connect to the stable Service endpoint, not to individual Pod IPs that change.
Critical for serving: An inference API endpoint is typically exposed via a Service, often of type ClusterIP for internal access or LoadBalancer for external access.

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization, memory consumption, or custom metrics.

Key for cost optimization: Scales your inference service up during peak demand and down during lulls.
Metric-driven: Can scale based on application-specific metrics like inference request queue length or model latency, gathered via the Kubernetes Metrics API or custom adapters.
Works with Deployments: The HPA directly modifies the replicas field of the targeted Deployment.

Rolling Update

The default update strategy for a Kubernetes Deployment. A rolling update incrementally replaces old Pods with new ones, ensuring zero-downtime deployments of new model versions.

Controlled rollout: Updates Pods a few at a time (maxUnavailable, maxSurge), rather than all at once.
Enables canary and blue-green deployments: By manipulating labels and Service selectors, rolling updates form the basis for these advanced strategies.
Automatic rollback: If the new Pods fail their readiness checks, the Deployment automatically halts and rolls back to the previous stable version.

ConfigMap & Secret

ConfigMaps and Secrets are Kubernetes objects used to inject configuration data and sensitive information into Pods, separating configuration from application code.

ConfigMap: Holds non-confidential configuration data (e.g., model configuration file for Triton, preprocessing parameters). Mounted as files or environment variables.
Secret: Holds sensitive data like API keys, database passwords, or model registry credentials. Stored encoded by default.
Essential for ML Ops: Allows you to update model server settings or rotate credentials without rebuilding and redeploying your container image. A Deployment's Pod template references these objects.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Kubernetes Deployment

What is Kubernetes Deployment?

Key Features of Kubernetes Deployments

Declarative Desired State

Rolling Updates and Rollbacks

Pod Replication and Scaling

Health Monitoring with Probes

Resource Management and Limits

Integration with Ecosystem Tools

How a Kubernetes Deployment Works

Deployment Update Strategies for Model Serving

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there