Inferensys

Glossary

Seldon Core

Seldon Core is an open-source platform for deploying, managing, monitoring, and explaining machine learning models on Kubernetes, supporting complex inference graphs and advanced deployment strategies.
DevOps engineer deploying LLM to production on laptop, Kubernetes dashboards visible, late night deployment session.
MODEL SERVING ARCHITECTURES

What is Seldon Core?

Seldon Core is an open-source, Kubernetes-native platform for deploying, scaling, and managing machine learning models in production.

Seldon Core is an open-source model serving platform designed to deploy, manage, and monitor machine learning models on Kubernetes. It transforms trained models from frameworks like TensorFlow or PyTorch into scalable microservices with REST and gRPC APIs. The platform supports complex inference graphs, allowing models to be chained with pre- and post-processing steps, business logic, and transformers into a single, deployable unit. This enables sophisticated model pipelines and advanced deployment strategies like canary rollouts directly within the Kubernetes ecosystem.

A core feature is its support for advanced deployment patterns, including A/B testing, multi-armed bandits, and shadow deployments, managed through declarative Kubernetes custom resources. It provides built-in model monitoring for metrics like throughput and latency, and integrates with explainability toolkits. By abstracting the operational complexity, Seldon Core allows MLOps engineers to focus on model performance rather than infrastructure, ensuring reliable, scalable, and observable inference services in production environments.

MODEL SERVING ARCHITECTURES

Key Features of Seldon Core

Seldon Core is an open-source platform for deploying, managing, and monitoring machine learning models on Kubernetes. It enables complex inference graphs and advanced deployment strategies for production ML workloads.

01

Complex Inference Graphs

Seldon Core enables the orchestration of sophisticated inference pipelines, known as Seldon Deployments. These are not single models but directed acyclic graphs (DAGs) where nodes can be:

  • Models: Trained ML models (TensorFlow, PyTorch, SKLearn, etc.).
  • Transformers: Pre- and post-processing components for feature engineering.
  • Routers: Logic for A/B testing, multi-armed bandits, or ensemble methods.
  • Combiners: Aggregators that merge outputs from multiple model branches. This allows engineers to build production-grade pipelines that mirror real-world business logic, such as chaining a fraud detection model with a recommendation model.
02

Advanced Deployment Strategies

The platform provides enterprise-grade rollout patterns directly on Kubernetes, abstracting away complex YAML configurations. Key strategies include:

  • Canary Rollouts: Safely deploy a new model version to a small percentage of traffic (e.g., 5%) to monitor performance before full promotion.
  • Shadow Deployment: Send a copy of live traffic to a new model for comparison without affecting user-facing predictions.
  • Multi-Armed Bandit Testing: Dynamically route traffic to the best-performing model variant based on real-time feedback, optimizing for business metrics. These strategies are managed through a custom Kubernetes SeldonDeployment resource, providing a declarative API for ML ops.
03

Built-in Explainability & Outlier Detection

Seldon Core integrates directly with the Alibi and Alibi Detect open-source libraries to provide production-ready explainability and monitoring.

  • Explainers: Generate feature attribution explanations (e.g., SHAP, Anchor, Integrated Gradients) for individual predictions to debug model behavior.
  • Outlier Detectors: Identify anomalous inference requests that fall outside the model's trained data distribution using methods like Mahalanobis distance or Isolation Forests. These components can be added as sidecar containers to any model in the inference graph, providing actionable insights without modifying the core model code.
04

Rich Metrics & Observability

The platform automatically exposes a comprehensive set of metrics for monitoring model performance and infrastructure health.

  • Request-Level Metrics: Latency, throughput, and 4xx/5xx error rates for each model in a graph.
  • Custom Business Metrics: Users can emit arbitrary metrics (e.g., prediction value distributions) from within model code.
  • Integration with Prometheus & Grafana: All metrics are exposed in Prometheus format, enabling seamless integration with standard Kubernetes monitoring stacks.
  • Distributed Tracing: Supports OpenTelemetry for tracing requests as they flow through complex inference graphs, identifying latency bottlenecks.
05

Protocol Flexibility & gRPC Optimization

Seldon Core serves models over multiple protocols to suit different client needs and optimize performance.

  • REST/gRPC Endpoints: Automatic generation of both RESTful JSON APIs and high-performance gRPC endpoints from the same model packaging.
  • gRPC for High Throughput: The gRPC interface, using protocol buffers, is significantly faster for high-volume inference, reducing serialization overhead.
  • Seldon Prediction Protocol: A consistent internal wire format that standardizes payloads between components in an inference graph, regardless of the original model framework.
06

Kubernetes-Native Architecture

Seldon Core is designed as a set of Kubernetes operators and custom resource definitions (CRDs), making it a natural extension of the Kubernetes ecosystem.

  • SeldonDeployment CRD: The primary resource for defining inference graphs, leveraging Kubernetes for lifecycle management.
  • Operator Pattern: Controllers automatically reconcile the desired state, managing the creation of Kubernetes Deployments, Services, and Horizontal Pod Autoscalers.
  • Integration with Service Meshes: Works seamlessly with Istio or Linkerd for advanced traffic management, security (mTLS), and observability features at the service mesh layer.
MODEL SERVING ARCHITECTURE

How Seldon Core Works

Seldon Core is an open-source, Kubernetes-native platform for deploying, scaling, and managing machine learning models in production.

Seldon Core deploys models as Kubernetes Custom Resources, wrapping them in standardized inference servers called Seldon Servers. These servers expose a consistent REST or gRPC API, abstracting the underlying model framework (e.g., TensorFlow, PyTorch). The platform orchestrates these servers as pods, managing their lifecycle, networking, and load balancing automatically through the Kubernetes control plane. This provides a resilient, scalable foundation for online inference and batch inference workloads.

Its core innovation is the Seldon Deployment specification, which allows users to define complex inference graphs as directed acyclic graphs (DAGs). These graphs can chain multiple models, preprocessing and postprocessing components, and routers into a single deployable unit. This enables advanced patterns like A/B testing, multi-armed bandits, and ensembles. The platform integrates with service meshes like Istio for fine-grained traffic management, supporting canary deployments and blue-green deployments with minimal operational overhead.

FEATURE COMPARISON

Seldon Core vs. Other Model Serving Platforms

A technical comparison of core capabilities across leading open-source and commercial platforms for deploying machine learning models on Kubernetes.

Feature / CapabilitySeldon CoreKServeNVIDIA Triton Inference ServerProprietary Cloud Endpoints (e.g., SageMaker, Vertex AI)

Core Architecture

Kubernetes-native operator for custom resources

Kubernetes-native standard (Knative/Istio)

Standalone inference server (containerized)

Managed cloud service with proprietary APIs

Multi-Framework Support

True (via prepackaged servers or custom containers)

True (via pre-defined ServingRuntimes)

True (native support for TensorFlow, PyTorch, ONNX, etc.)

True (often with framework-specific containers)

Complex Inference Graphs/Pipelines

True (native composition via Seldon CRDs)

True (via InferenceGraph spec)

True (via Ensemble models)

Limited (often chaining via separate services)

Advanced Deployment Strategies

True (Canary, Shadow, A/B, Multi-Armed Bandit)

True (Canary, Blue-Green via Istio)

False (requires external orchestrator)

True (managed canary, A/B via cloud console)

Built-in Explainability (e.g., LIME, SHAP)

True (integrated Alibi explainers)

False (requires manual integration)

False (requires manual integration)

False (proprietary or manual integration)

Built-in Outlier/Drift Detection

True (integrated Alibi detect)

False

False

Limited (proprietary cloud monitoring)

Request/Response Logging & Metrics

True (automatic Prometheus/Grafana)

True (via Knative/Istio metrics)

True (Prometheus metrics endpoint)

True (cloud-native monitoring dashboards)

GPU Support & Multi-Model Batching

True (via underlying inference server)

True (via underlying inference server)

True (native dynamic batching, concurrent models)

True (managed batching, instance types)

Model Orchestration & Lifecycle

Kubernetes Operator (GitOps friendly)

Kubernetes Custom Resource Definitions

Model Repository polling, REST/gRPC management APIs

Fully managed (proprietary CLI/UI)

Vendor Lock-in Risk

False (CNCF sandbox, pure Kubernetes)

False (open standard, part of Kubeflow)

Low (open-source, but NVIDIA-optimized)

True (proprietary APIs, cloud-specific features)

Enterprise Auth & Security (mTLS, RBAC)

True (via Istio integration)

True (via Istio integration)

Partial (supports auth extensions)

True (cloud IAM, VPC, private endpoints)

Cost Model for Core Platform

$0 (open-source)

$0 (open-source)

$0 (open-source)

Variable (per-hour endpoint + prediction costs)

SELDON CORE

Frequently Asked Questions

Seldon Core is an open-source platform for deploying, managing, monitoring, and explaining machine learning models on Kubernetes. These FAQs address its core mechanisms, architecture, and role in production ML systems.

Seldon Core is an open-source Kubernetes-native platform for deploying, scaling, and managing machine learning models as microservices. It works by packaging models, their dependencies, and any required pre/post-processing code into a standardized inference server container. These containers are then orchestrated by Seldon's custom resource definitions (CRDs) on Kubernetes, which handle lifecycle management, traffic routing, and integration with the broader cloud-native ecosystem (e.g., Istio for networking, Prometheus for metrics). The platform abstracts the complexity of Kubernetes, allowing data scientists to define complex inference graphs—sequential, ensemble, or branching pipelines of models—through simple YAML configurations.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.