Inferensys

Glossary

Model Serving Architectures

Terms related to the software systems and patterns for deploying and scaling models in production. Target: [ML Ops Engineers, DevOps].
MLOps engineer reviewing model serving infrastructure on laptop, container orchestration visible, technical workspace.
Glossary

Model Serving Architectures

Terms related to the software systems and patterns for deploying and scaling models in production. Target: [ML Ops Engineers, DevOps].

Model Serving

Model serving is the process of deploying a trained machine learning model into a production environment where it can receive input data, perform inference, and return predictions via a defined interface.

Inference Server

An inference server is a specialized software application or service designed to load machine learning models, manage computational resources, and execute inference requests at scale with low latency and high throughput.

Model Deployment

Model deployment is the phase of the machine learning lifecycle where a trained model is integrated into a live production environment, making its predictive capabilities available to end-users or other software systems.

API Endpoint

An API endpoint is a specific URL or network address exposed by a model serving system that accepts HTTP or gRPC requests containing input data and returns the model's predictions as a structured response.

Containerization

Containerization is the practice of packaging a model, its dependencies, runtime, and configuration into a standardized, isolated software unit called a container, ensuring consistent execution across different computing environments.

Kubernetes Deployment

A Kubernetes Deployment is a declarative configuration object that manages the desired state for a set of identical pods running a containerized application, such as a model inference service, handling updates and scaling.

Service Mesh

A service mesh is a dedicated infrastructure layer for managing service-to-service communication within a microservices architecture, providing observability, security, and traffic control for distributed model inference services.

API Gateway

An API gateway is a reverse proxy that acts as a single entry point for client requests, routing them to appropriate backend services (like inference servers), while handling cross-cutting concerns like authentication, rate limiting, and logging.

Canary Deployment

Canary deployment is a release strategy where a new version of a model is initially deployed to a small subset of production traffic to validate its performance and stability before a full rollout.

Blue-Green Deployment

Blue-green deployment is a release strategy that maintains two identical production environments (blue and green), allowing for instantaneous traffic switching between an old (stable) version and a new version of a model with zero downtime.

Model Versioning

Model versioning is the practice of assigning unique identifiers to different iterations of a machine learning model, enabling tracking, rollback, and simultaneous serving of multiple model variants.

Multi-Tenancy

Multi-tenancy in model serving is an architectural pattern where a single inference server or cluster simultaneously hosts and isolates multiple distinct models or clients, optimizing resource utilization.

Model Parallelism

Model parallelism is a distributed computing technique that partitions a single large machine learning model across multiple devices (e.g., GPUs) to overcome memory limitations, with each device executing a different portion of the model graph.

Pipeline Parallelism

Pipeline parallelism is a form of model parallelism where the layers of a neural network are distributed sequentially across multiple devices, forming a processing pipeline to increase throughput for batch inference.

Cold Start

Cold start is the initial latency incurred when a model inference service must load a model from persistent storage into memory and initialize its runtime environment before it can serve the first request.

Model Caching

Model caching is the technique of keeping a loaded machine learning model resident in memory (RAM or GPU memory) to eliminate the overhead of repeated disk I/O and initialization for subsequent inference requests.

Serverless Inference

Serverless inference is a cloud computing execution model where a model is deployed as a stateless function that automatically scales from zero based on incoming request events, with the cloud provider managing the underlying infrastructure.

Sidecar Pattern

The sidecar pattern is a microservices design pattern where a helper container (the sidecar) is deployed alongside a primary application container (e.g., a model server) to provide auxiliary functions like logging, monitoring, or proxying.

Model Pipeline

A model pipeline is a sequence of interconnected processing stages, which may include data preprocessing, inference across one or more models, and postprocessing, orchestrated to produce a final prediction or decision.

Multi-Model Serving

Multi-model serving is the capability of an inference server or platform to load, manage, and execute predictions for multiple different machine learning models concurrently within the same runtime environment.

Model Monitoring

Model monitoring is the continuous observation of a deployed model's performance, behavior, and operational health in production, tracking metrics like prediction accuracy, latency, throughput, and data drift.

Model Drift Detection

Model drift detection is the process of identifying and alerting when the statistical properties of the live input data diverge from the data the model was trained on, or when the model's predictive performance degrades over time.

Triton Inference Server

Triton Inference Server (formerly TensorRT Inference Server) is an open-source, multi-framework serving software from NVIDIA optimized for deploying AI models from frameworks like TensorFlow, PyTorch, and ONNX at scale on both GPU and CPU.

KServe

KServe is a cloud-native, high-performance model serving standard built for Kubernetes, providing a simple and scalable interface to deploy and serve machine learning models with advanced capabilities like canary rollouts and autoscaling.

Seldon Core

Seldon Core is an open-source platform for deploying, managing, monitoring, and explaining machine learning models on Kubernetes, supporting complex inference graphs and advanced deployment strategies.

Model Registry

A model registry is a centralized repository for storing, versioning, and managing metadata for trained machine learning models, facilitating collaboration and governance throughout the model lifecycle.

Online Inference

Online inference (or real-time inference) is a model serving pattern where predictions are generated synchronously and returned with low latency in response to individual, live user requests.

Batch Inference

Batch inference is a model serving pattern where predictions are generated asynchronously for large volumes of pre-collected input data, prioritizing throughput over low-latency response for individual requests.

Load Balancer

A load balancer is a network device or software component that distributes incoming inference requests across multiple backend servers or pods to optimize resource use, maximize throughput, and ensure high availability.

Auto-Scaling

Auto-scaling is the capability of a cloud or container orchestration platform to automatically adjust the number of compute instances or pods running a model service based on real-time demand metrics like CPU utilization or request rate.