Inferensys

Glossary

Kubernetes Deployment

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling.
Developer testing AI inference on mobile phone in hand, laptop with optimization code visible, casual tech review moment.
MODEL SERVING ARCHITECTURES

What is Kubernetes Deployment?

A core Kubernetes object for declaratively managing the lifecycle of containerized applications, such as model inference services.

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical Pods running a containerized application, such as a model inference service. It automates critical operational tasks including rolling updates, rollbacks, and scaling, ensuring the specified number of replica Pods are always running and healthy. This abstraction is fundamental for deploying stateless, scalable microservices.

For machine learning, a Deployment YAML file defines the container image (e.g., a Triton Inference Server), resource requests/limits, and environment variables. The Deployment controller continuously reconciles the actual state with this declared state, handling node failures by rescheduling Pods. It enables canary and blue-green deployments for safe model updates and integrates with Horizontal Pod Autoscaler for demand-based scaling, forming the backbone of resilient model serving infrastructure.

MODEL SERVING ARCHITECTURES

Key Features of Kubernetes Deployments

A Kubernetes Deployment provides declarative updates and lifecycle management for a set of identical Pods running a containerized application, such as a model inference service. It is the primary controller for managing stateless applications in Kubernetes.

01

Declarative Desired State

A Deployment's core function is to declaratively manage a desired state for your application. You define the state in a YAML manifest (e.g., 3 replicas, container image my-model:v2), and the Deployment's controller continuously reconciles the actual cluster state to match. This automation is critical for maintaining the availability of inference endpoints.

  • Example: A manifest specifies replicas: 5 and image: inference-service:3.1.0. The controller ensures exactly five Pods are always running with that image.
  • Mechanism: The controller acts on differences between the .spec (desired) and .status (actual) fields, creating or deleting Pods via a ReplicaSet.
02

Rolling Updates and Rollbacks

Deployments enable zero-downtime updates through a rolling update strategy. When you update the Pod template (e.g., a new model version), the controller incrementally replaces old Pods with new ones, ensuring a subset of replicas remains available to serve traffic. This is essential for deploying new model versions without interrupting service.

  • Strategy Controls: You can configure maxUnavailable and maxSurge to control the pace of the rollout.
  • Automatic Rollback: If a update fails health checks, the Deployment can automatically roll back to the previous stable version, minimizing risk.
  • Version History: Kubernetes maintains a revision history, allowing manual rollback to any previous Deployment revision.
03

Pod Replication and Scaling

Deployments ensure high availability and scale by managing multiple identical Pod replicas. The replica count is a core spec field. Scaling is managed manually via kubectl scale or automatically by a Horizontal Pod Autoscaler (HPA) based on CPU/memory usage or custom metrics like inference request queue length.

  • High Availability: Distributing replicas across cluster nodes protects against node failures.
  • Load Distribution: A Kubernetes Service load-balances incoming inference requests across all healthy Pod replicas.
  • Example: An HPA policy can scale a model deployment from 3 to 10 replicas during peak inference demand, then scale back down to conserve resources.
04

Health Monitoring with Probes

Deployments rely on readiness and liveness probes to manage Pod lifecycle and ensure traffic is only sent to healthy endpoints. This is critical for model serving, where initialization (loading large models) can take significant time.

  • Readiness Probe: Determines if a Pod is ready to receive traffic. A model server pod should only be marked ready after the model is fully loaded into GPU memory.
  • Liveness Probe: Determines if a Pod is running correctly. If it fails, the Pod is restarted. This can recover a model server that has hung or crashed.
  • Startup Probe: Used for legacy applications that require extra long startup times, allowing separate configuration for initial startup health checks.
05

Resource Management and Limits

Deployments allow precise specification of resource requests and limits for CPU and memory (including GPU). This is fundamental for cost control and performance stability in inference clusters, preventing a single greedy model from starving others.

  • Requests: Guaranteed resources reserved for the Pod (e.g., 4 CPUs, 16Gi memory, 1 nvidia.com/gpu). The scheduler uses this to place Pods on suitable nodes.
  • Limits: The maximum resources a Pod can use. Exceeding a memory limit causes the Pod to be terminated (OOMKilled).
  • Quality of Service (QoS): Based on these settings, Kubernetes assigns Pods a QoS class (Guaranteed, Burstable, BestEffort), influencing scheduling and eviction priority.
06

Integration with Ecosystem Tools

Deployments are not used in isolation; they are the foundational object integrated with higher-level Kubernetes ecosystem tools specifically designed for ML serving, providing advanced capabilities like canary deployments, inference graphs, and specialized hardware acceleration.

  • KServe/Knative Serving: Provides serverless, auto-scaling capabilities with request-level metrics and canary rollouts on top of Deployments.
  • Seldon Core: Uses Deployments to orchestrate complex inference graphs (pre-processing -> model A -> model B -> post-processing).
  • NVIDIA GPU Operator/Kubernetes Device Plugins: Manages GPU resource scheduling and drivers, allowing Deployments to request nvidia.com/gpu resources.
  • Service Meshes (Istio, Linkerd): Provide advanced traffic routing, mirroring, and security between Deployment Pods for sophisticated canary and A/B testing of models.
MODEL SERVING ARCHITECTURES

How a Kubernetes Deployment Works

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling.

A Kubernetes Deployment is a declarative API object that defines a desired state for a replicated application. You specify the container image, number of replicas (pods), and update strategy. The Deployment controller continuously observes the cluster, reconciling the actual state to match your declared intent by creating, scaling, or updating pods. This automation is foundational for model serving architectures, ensuring a specified number of inference server instances are always running.

For updates, the controller orchestrates a rolling update by incrementally replacing old pods with new ones, maintaining service availability—critical for canary or blue-green deployments. It also enables auto-scaling via the Horizontal Pod Autoscaler based on metrics like CPU usage. This declarative, self-healing management is essential for ML Ops, providing a robust, scalable foundation for deploying and maintaining model inference services in production.

KUBERNETES DEPLOYMENT PATTERNS

Deployment Update Strategies for Model Serving

A comparison of Kubernetes strategies for updating model inference services, balancing availability, rollback speed, and resource overhead.

StrategyRolling UpdateRecreateBlue-GreenCanary

Update Mechanism

Gradual pod replacement

Terminate all, then create new

Full parallel environment swap

Controlled traffic split to new version

Downtime

Rollback Speed

Medium (pods must cycle)

Fast (single update)

Instant (traffic switch)

Instant (traffic switch)

Resource Overhead

Low (temporary 2x pods)

None

High (2x full environment)

Medium (partial 2x pods)

Traffic Control Granularity

Pod-level

N/A

Environment-level

Request-level (e.g., 5%)

Testing in Production

Kubernetes Native Support

Typical Use Case

General stateless updates

Breaking changes, schema migrations

Zero-downtime major releases

Performance/accuracy validation

KUBERNETES DEPLOYMENT

Frequently Asked Questions

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling. These FAQs address its core mechanisms and role in production machine learning.

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical Pods running a containerized application, such as a model inference service. It works by continuously reconciling the actual state of the cluster with the state defined in its manifest file (typically YAML). The Deployment controller automatically handles pod creation, scaling, updates, and rollbacks. For a model serving use case, you define the container image (e.g., a Triton Inference Server), resource requests/limits (CPU, GPU memory), and the number of replicas. Kubernetes then ensures that the specified number of pods are always running and healthy, restarting them if they fail.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.