Glossary
Model Serving Architectures

Model Serving Architectures
Terms related to the software systems and patterns for deploying and scaling models in production. Target: [ML Ops Engineers, DevOps].
Model Serving
Model serving is the process of deploying a trained machine learning model into a production environment where it can receive input data, perform inference, and return predictions via a defined interface.
Inference Server
An inference server is a specialized software application or service designed to load machine learning models, manage computational resources, and execute inference requests at scale with low latency and high throughput.
Model Deployment
Model deployment is the phase of the machine learning lifecycle where a trained model is integrated into a live production environment, making its predictive capabilities available to end-users or other software systems.
API Endpoint
An API endpoint is a specific URL or network address exposed by a model serving system that accepts HTTP or gRPC requests containing input data and returns the model's predictions as a structured response.
Containerization
Containerization is the practice of packaging a model, its dependencies, runtime, and configuration into a standardized, isolated software unit called a container, ensuring consistent execution across different computing environments.
Kubernetes Deployment
A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling.
Service Mesh
A service mesh is a dedicated infrastructure layer for managing service-to-service communication within a microservices architecture, providing observability, security, and traffic control for distributed model inference services.
API Gateway
An API gateway is a reverse proxy that acts as a single entry point for client requests, routing them to appropriate backend services (like inference servers), while handling cross-cutting concerns like authentication, rate limiting, and logging.
Canary Deployment
Canary deployment is a release strategy where a new version of a model is initially deployed to a small subset of production traffic to validate its performance and stability before a full rollout.
Blue-Green Deployment
Blue-green deployment is a release strategy that maintains two identical production environments (blue and green), allowing for instantaneous traffic switching between an old (stable) version and a new version of a model with zero downtime.
Model Versioning
Model versioning is the practice of assigning unique identifiers to different iterations of a machine learning model, enabling tracking, rollback, and simultaneous serving of multiple model variants.
Multi-Tenancy
Multi-tenancy in model serving is an architectural pattern where a single inference server or cluster simultaneously hosts and isolates multiple distinct models or clients, optimizing resource utilization.
Model Parallelism
Model parallelism is a distributed computing technique that partitions a single large machine learning model across multiple devices (e.g., GPUs) to overcome memory limitations, with each device executing a different portion of the model graph.
Pipeline Parallelism
Pipeline parallelism is a form of model parallelism where the layers of a neural network are distributed sequentially across multiple devices, forming a processing pipeline to increase throughput for batch inference.
Cold Start
Cold start is the initial latency incurred when a model inference service must load a model from persistent storage into memory and initialize its runtime environment before it can serve the first request.
Model Caching
Model caching is the technique of keeping a loaded machine learning model resident in memory (RAM or GPU memory) to eliminate the overhead of repeated disk I/O and initialization for subsequent inference requests.
Serverless Inference
Serverless inference is a cloud computing execution model where a model is deployed as a stateless function that automatically scales from zero based on incoming request events, with the cloud provider managing the underlying infrastructure.
Sidecar Pattern
The sidecar pattern is a microservices design pattern where a helper container (the sidecar) is deployed alongside a primary application container (e.g., a model server) to provide auxiliary functions like logging, monitoring, or proxying.
Model Pipeline
A model pipeline is a sequence of interconnected processing stages, which may include data preprocessing, inference across one or more models, and postprocessing, orchestrated to produce a final prediction or decision.
Multi-Model Serving
Multi-model serving is the capability of an inference server or platform to load, manage, and execute predictions for multiple different machine learning models concurrently within the same runtime environment.
Model Monitoring
Model monitoring is the continuous observation of a deployed model's performance, behavior, and operational health in production, tracking metrics like prediction accuracy, latency, throughput, and data drift.
Model Drift Detection
Model drift detection is the process of identifying and alerting when the statistical properties of the live input data diverge from the data the model was trained on, or when the model's predictive performance degrades over time.
Triton Inference Server
Triton Inference Server (formerly TensorRT Inference Server) is an open-source, multi-framework serving software from NVIDIA optimized for deploying AI models from frameworks like TensorFlow, PyTorch, and ONNX at scale on both GPU and CPU.
KServe
KServe is a cloud-native, high-performance model serving standard built for Kubernetes, providing a simple and scalable interface to deploy and serve machine learning models with advanced capabilities like canary rollouts and autoscaling.
Seldon Core
Seldon Core is an open-source platform for deploying, managing, monitoring, and explaining machine learning models on Kubernetes, supporting complex inference graphs and advanced deployment strategies.
Model Registry
A model registry is a centralized repository for storing, versioning, and managing metadata for trained machine learning models, facilitating collaboration and governance throughout the model lifecycle.
Online Inference
Online inference (or real-time inference) is a model serving pattern where predictions are generated synchronously and returned with low latency in response to individual, live user requests.
Batch Inference
Batch inference is a model serving pattern where predictions are generated asynchronously for large volumes of pre-collected input data, prioritizing throughput over low-latency response for individual requests.
Load Balancer
A load balancer is a network device or software component that distributes incoming inference requests across multiple backend servers or pods to optimize resource use, maximize throughput, and ensure high availability.
Auto-Scaling
Auto-scaling is the capability of a cloud or container orchestration platform to automatically adjust the number of compute instances or pods running a model service based on real-time demand metrics like CPU utilization or request rate.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us