Connection draining is a resilience pattern that gracefully removes a compute instance from a load balancer's active pool by allowing existing, in-flight requests to complete while refusing all new connections. This process, also known as connection termination or instance deregistration, is a core component of graceful shutdown procedures in microservices and cloud-native architectures. It prevents abrupt connection termination, which can cause user-facing errors and data corruption, by ensuring active sessions finish their work.
Glossary
Connection Draining

What is Connection Draining?
A critical resilience pattern for graceful instance termination in distributed systems.
The pattern is implemented by signaling the load balancer to stop sending new traffic to a target instance while a deregistration delay timer counts down. During this period, the instance processes its remaining in-flight requests before finally terminating. This is essential for zero-downtime deployments, auto-scaling events, and chaos engineering tests, as it maintains system stability and user experience during infrastructure changes. It works in concert with health checks and the circuit breaker pattern to build fault-tolerant systems.
Key Characteristics of Connection Draining
Connection draining is a critical resilience pattern for gracefully removing service instances. It ensures in-flight requests complete while preventing new connections, enabling zero-downtime deployments and failover.
Graceful Shutdown Mechanism
Connection draining is the controlled process of removing a server instance from a load balancer's active pool. The core mechanism involves two simultaneous actions:
- Refusing new connections: The load balancer stops routing new client requests to the instance.
- Completing in-flight requests: The instance continues processing and responding to all existing, established connections until they naturally terminate or a timeout is reached.
This prevents cascading failures that can occur when instances are terminated mid-request, which could corrupt client state or cause user-facing errors.
Configurable Draining Timeout
A draining timeout is a mandatory configuration parameter that defines the maximum duration the process is allowed to take. This acts as a safety mechanism to prevent instances from hanging indefinitely.
- Typical Settings: Timeouts commonly range from 1 to 300 seconds (5 minutes), depending on the application's maximum expected request duration.
- Timeout Behavior: When the timeout expires, any remaining connections are forcibly terminated. This ensures the deployment or scaling event proceeds, trading perfect grace for operational progress.
- Setting Strategy: The timeout should be set slightly higher than the 99th percentile (P99) of your application's request latency to cover nearly all normal operations.
Integration with Health Checks
Connection draining works in tandem with application health checks to provide a coherent shutdown signal.
- Draining State vs. Unhealthy State: When an instance enters a draining state, it typically continues to respond to health check probes as 'healthy'. This is distinct from marking an instance as 'unhealthy,' which would cause immediate, forceful ejection.
- Orchestrator Coordination: In platforms like Kubernetes, the sequence is: 1) The pod receives a termination signal. 2) The pod's status changes to 'Terminating'. 3) The kube-proxy and ingress controller stop sending new traffic. 4) The application begins its graceful shutdown, using the remaining time to drain connections.
- Pre-stop Hooks: Many systems use a pre-stop lifecycle hook to initiate custom application cleanup logic before the container runtime sends the final SIGKILL signal.
Prevention of Cascading Failures
The primary resilience objective of connection draining is to prevent cascading failures during deployments, scaling-in, or instance failure recovery.
- Context in Circuit Breakers: In a multi-agent or microservices architecture, abruptly terminating an instance can cause upstream callers to receive TCP connection resets or HTTP 5xx errors. These failures can propagate back through the call chain.
- Controlled Failure Domain: By draining, you contain the failure domain to a single instance. Upstream services using retry logic with exponential backoff can seamlessly retry failed requests on other healthy instances, often without the end-user noticing.
- Contrast with Fail-Fast: This is a complementary pattern to Fail-Fast. While fail-fast immediately rejects calls to a known-bad dependency, draining ensures the provider of a service doesn't become the cause of failures for its consumers during controlled shutdowns.
Use in Deployment Strategies
Connection draining is a foundational enabler for advanced, zero-downtime deployment strategies.
- Blue-Green Deployments: As traffic is switched from the 'blue' (old) environment to the 'green' (new) environment, the blue instances are drained of connections before being decommissioned.
- Canary Releases: When a canary instance (running a new version) is determined to be unhealthy, it is drained and removed without affecting traffic to the stable baseline version.
- Rolling Updates: In Kubernetes, a rolling update sequentially replaces pods. Each pod is drained and terminated before the next new pod is created, maintaining the desired replica count and service capacity throughout the update.
- Auto-Scaling Events: When a cloud autoscaler decides to scale in (remove an instance), it first initiates draining through the load balancer API, ensuring no active user sessions are dropped.
Stateful vs. Stateless Considerations
The implementation and importance of connection draining vary significantly between stateful and stateless application architectures.
- Stateless Services: Draining is simpler. The goal is to complete HTTP requests or RPC calls. Once the last response is sent, the instance can terminate. Sticky sessions (session affinity) must be considered; the load balancer should stop assigning new sessions to a draining instance.
- Stateful Services & Persistent Connections: Draining is critical and more complex. Examples include:
- WebSocket Servers: Long-lived connections must be notified to reconnect elsewhere or be gracefully closed.
- Database Connections: Connection pools held by the instance must complete or hand off transactions.
- Streaming Data Pipelines: Consumers need to commit their offsets before shutting down.
- Agentic Systems: In a multi-agent system, an agent with in-memory context for a long-running task must persist or transfer its state before draining is complete, a concept related to agentic rollback strategies.
Connection Draining in Major Platforms
A feature comparison of connection draining capabilities across major cloud platforms and load balancers, detailing configuration options, default behaviors, and operational specifics.
| Feature / Platform | AWS (ELB/ALB/NLB) | Google Cloud (GCLB) | Azure Load Balancer | NGINX | HAProxy |
|---|---|---|---|---|---|
Terminology | Connection Draining (Classic ELB) / Deregistration Delay (ALB/NLB) | Connection Draining | Drain Mode | Graceful Shutdown | Graceful Stop |
Default Draining Timeout | 300 seconds | 300 seconds | 0 seconds (immediate) | N/A (configurable) | N/A (configurable) |
Maximum Configurable Timeout | 3600 seconds | 3600 seconds | 3600 seconds | Unlimited (via | Unlimited |
Protocol Support | TCP, TLS, HTTP, HTTPS, UDP (NLB) | TCP, SSL, HTTP, HTTPS | TCP, UDP | All proxied protocols | All proxied protocols |
Per-Target/Listener Configuration | |||||
API/CLI Trigger | |||||
Integration with Auto-Scaling | Automatic on instance termination | Automatic on instance termination | Automatic in VMSS scale-in | Manual or scripted | Manual or scripted |
Draining State Visibility | Via DescribeTargetHealth API & Console | Via Console & gcloud CLI | Via Azure Portal & Metrics | Access logs & status page | Stats socket & admin page |
Forces Close on Timeout | |||||
Impact on Health Checks | Stopped during drain | Stopped during drain | Stopped during drain | Configurable | Configurable |
Frequently Asked Questions
Connection draining is a critical resilience pattern for gracefully managing service instance lifecycle. These questions address its core mechanisms, implementation, and role in modern, fault-tolerant architectures.
Connection draining is the process of gracefully removing a service instance (like a server, pod, or container) from a load balancer's rotation by allowing existing, in-flight connections to complete their work while refusing all new connection requests. It works by signaling the load balancer to change the instance's status. The load balancer stops sending new requests to the instance but continues to allow existing requests a configurable amount of time (the drain timeout) to finish processing. This ensures active sessions—such as file uploads, database transactions, or streaming responses—are not abruptly terminated, preventing data corruption and user-facing errors.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
These terms define the core mechanisms and supporting patterns used to build fault-tolerant, self-healing systems that prevent cascading failures.
Circuit Breaker Pattern
A software design pattern that detects failures and prevents an application from repeatedly attempting an operation that is likely to fail. It operates in three states:
- Closed: Requests flow normally.
- Open: Requests fail immediately without calling the downstream service.
- Half-Open: A limited number of test requests are allowed to probe for recovery. Its primary function is to stop cascading failures and allow time for a failing dependency to recover, acting as a fail-fast mechanism.
Bulkhead Pattern
A resilience pattern that isolates elements of an application into independent pools (bulkheads). If one component fails or is overwhelmed, the failure is contained, preventing a single point of failure from bringing down the entire system. In multi-agent systems, this can mean isolating different tool-calling agents or data sources into separate resource pools to ensure graceful degradation.
Health Check
A periodic diagnostic request (often an HTTP endpoint or a simple function call) sent to a service or component to verify its operational status and readiness to handle traffic. Failed health checks can trigger a circuit breaker to open or cause a load balancer to stop routing traffic to an unhealthy instance. Liveness probes check if a process is running, while readiness probes determine if it can accept work.
Graceful Degradation
A system design principle where functionality is reduced in a controlled, prioritized manner when a failure occurs or resources are constrained. The system maintains core operations while non-essential features are disabled. For example, an AI agent might disable its image-generation tool if the service is down but continue to process text-based queries, providing a degraded but acceptable user experience.
Fallback
A predefined alternative response or action that a system executes when a primary operation fails. This allows the system to provide a degraded but acceptable level of service. In agentic systems, a fallback could be:
- Returning cached data.
- Using a simpler, more reliable algorithm.
- Providing a user-friendly error message with manual steps. Fallbacks are a key strategy for implementing graceful degradation.
Retry Logic with Exponential Backoff
A programming technique for handling transient faults (temporary network glitches, timeouts).
- Retry Logic: Automatically re-attempts a failed operation.
- Exponential Backoff: The delay between retries increases exponentially (e.g., 1s, 2s, 4s, 8s). This reduces load on a struggling service and increases the chance it can recover. Jitter (randomness) is often added to retry timings to prevent the thundering herd problem where many clients retry simultaneously.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us