The Out-of-Memory (OOM) Killer is a Linux kernel process that autonomously selects and terminates one or more applications to free up RAM when the system faces critical memory exhaustion and cannot allocate more. It functions as a last-resort fault-tolerance mechanism, preventing a complete system lockup or crash by sacrificing specific processes to preserve overall stability. The kernel's OOM reaper subsystem invokes the killer after other memory management techniques, like aggressive caching and swapping, have failed to reclaim sufficient resources.
Glossary
Out-of-Memory (OOM) Killer

What is Out-of-Memory (OOM) Killer?
A core Linux kernel mechanism for autonomous system recovery during critical memory exhaustion.
The selection algorithm, governed by an oom_score assigned to each process, aims to maximize freed memory while minimizing disruption, typically targeting the largest consumer of RAM. This exemplifies a self-healing architectural pattern where the system performs automated root cause analysis and executes a corrective action plan without human intervention. In modern orchestrated environments like Kubernetes, understanding the OOM Killer is critical for configuring resource limits and pod disruption budgets to ensure graceful degradation and predictable recovery within a fault-tolerant agent design.
Key Characteristics of the OOM Killer
The Out-of-Memory (OOM) Killer is a last-resort process in the Linux kernel that terminates applications to prevent a complete system crash when available memory is critically exhausted.
Triggering Mechanism
The OOM Killer activates when the kernel cannot satisfy a memory request and cannot free sufficient memory through its normal page cache and swap mechanisms. This state is known as an OOM condition. The kernel evaluates this by checking the overcommit settings (vm.overcommit_memory).
- Overcommit Modes: The system can be set to always overcommit (mode 0), never overcommit (mode 2), or a heuristic mode (mode 1, default).
- Invocation: The kernel invokes the
out_of_memory()function, which selects a victim process using theoom_badness()scoring algorithm.
Victim Selection Algorithm (oom_badness)
The kernel assigns an oom_score to each running process to determine the best candidate for termination. The score is calculated primarily from the process's physical memory usage, adjusted by an oom_score_adj value set by userspace (range -1000 to 1000).
- Base Calculation: Starts with the process's resident set size (RSS) plus page table memory.
- Adjustment Factors: The score is weighted to favor terminating processes that:
- Are running as a non-root user.
- Have a short runtime.
- Are not performing direct hardware I/O.
- Have a low
oom_score_adj(negative values protect a process).
- Child processes typically inherit the parent's
oom_score_adj.
Configurability and Control
System administrators and container orchestrators can exert significant control over the OOM Killer's behavior through several interfaces.
- Process-Level Tuning: The
oom_score_adjvalue in/proc/[pid]/oom_score_adjallows prioritizing or protecting specific processes. A value of -1000 guarantees a process will not be killed. - Cgroup Control: In containerized environments, memory cgroups are the primary control mechanism. The
memory.oom_controlfile allows disabling the OOM Killer for the cgroup (oom_kill_disable 1) or monitoring OOM events. - Sysctl Parameters: Global kernel parameters like
vm.panic_on_oomcan be set to trigger a kernel panic instead of process killing.
Interaction with Cgroups and Containers
In modern infrastructure, the OOM Killer interacts with Linux control groups (cgroups), which are fundamental to containers (Docker, Kubernetes). Each cgroup has a memory limit, creating nested OOM domains.
- Hierarchical Enforcement: When a cgroup exceeds its memory limit, the OOM Killer is invoked within that cgroup's subtree. It only considers processes belonging to the offending cgroup and its children.
- Kubernetes Pods: A Kubernetes Pod is a cgroup. The
spec.containers[].resources.limits.memorydefines the limit. If the sum of all container memory usage in the pod exceeds this limit, an OOM Kill event occurs inside the pod's cgroup. - OOM Kill as an Event: Orchestrators like Kubernetes treat an OOM Kill as a
Reason: OOMKilledcontainer termination, which can trigger pod restart policies.
System Impact and Observability
An OOM Kill is a disruptive event with clear signals in system logs and metrics, crucial for debugging and building resilient systems.
- Log Traces: The kernel logs detailed messages to
/var/log/kern.logor viadmesg, including the killed process's PID,oom_score, and total memory freed. - Metrics: Monitoring systems should track
oom_killevents (e.g., theoom_killcounter in/proc/vmstat). - Performance Implications: The kill itself is fast, but the preceding memory exhaustion causes severe system thrashing, where the CPU spends most of its time swapping pages in and out, making the system unresponsive.
Design Philosophy and Trade-offs
The OOM Killer embodies a core Unix and Linux design principle: fail-fast and let-it-crash at the process level to preserve overall system stability. It is a pragmatic trade-off.
- Last Resort: It is not a memory management feature but a failure mitigation mechanism. Well-designed systems should use memory limits and monitoring to avoid triggering it.
- Non-Determinism: While the scoring algorithm is deterministic, the final victim can be unpredictable in complex, rapidly changing memory states, making it unsuitable for guaranteeing specific process survival.
- Contrast with Microkernels: Unlike systems with more aggressive isolation, the Linux monolithic kernel shares memory space, making a runaway process capable of starving the kernel itself, hence the need for the OOM Killer.
OOM Killer vs. Related Fault-Tolerance Mechanisms
This table compares the Linux Out-of-Memory Killer to other core fault-tolerance mechanisms, highlighting their distinct triggers, actions, and operational philosophies within resilient system design.
| Feature | Out-of-Memory (OOM) Killer | Circuit Breaker Pattern | Bulkhead Pattern | Let-It-Crash Philosophy |
|---|---|---|---|---|
Primary Trigger | System-wide memory exhaustion (RAM + swap) | Repeated failures of a downstream service call | Resource exhaustion in one subsystem (e.g., thread pool) | Any internal process error or exception |
Core Action | Selectively terminates a process to free memory | Trips open to fail-fast, blocking calls to the failing service | Isolates failure to a partitioned resource pool | Process is allowed to crash and is restarted by a supervisor |
Scope of Impact | System-wide (any user process) | Service-to-service communication path | Within-application resource groups | Individual, supervised actor/process |
Goal | Prevent kernel panic and keep host system running | Prevent cascading failures and allow recovery time | Limit blast radius of a failure | Achieve fault tolerance through isolation and restart |
Proactive/Reactive | Reactive (last-resort response) | Reactive (based on failure count/timeout) | Proactive (architectural isolation) | Proactive (architectural supervision) |
Key Metric | oom_score (badness heuristic) | Failure rate, timeout count | Resource pool utilization (e.g., threads, connections) | Process lifespan, restart count |
Automation Level | Fully automated by kernel | Implemented in client/service mesh | Architecturally enforced at design time | Orchestrated by supervisor hierarchy |
Recovery Mechanism | Memory freed by kill; process must be restarted externally | Automatic half-open state to test recovery; then closes | Healthy partitions remain operational; failed pool recovers | Supervisor automatically restarts process from clean state |
Frequently Asked Questions
The Out-of-Memory Killer is a critical Linux kernel process that acts as a last-resort mechanism to prevent total system failure by terminating processes when available memory is exhausted. It is a foundational component of resilient, self-healing software systems.
The Out-of-Memory (OOM) Killer is a process within the Linux kernel that is invoked when the system is critically low on available RAM and cannot free sufficient memory through normal means, forcing it to select and terminate one or more processes to prevent a complete system crash.
When a Linux system faces an out-of-memory condition, the kernel first attempts to reclaim memory through its regular mechanisms, such as swapping to disk (if swap is configured) and dropping clean page caches. If these efforts fail to free enough memory, the OOM Killer is triggered. Its primary function is to sacrifice specific processes to keep the core kernel operational, thereby acting as a circuit breaker for system memory. This mechanism is a classic example of a fault-tolerant design, prioritizing overall system stability over the survival of any single application.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
The Out-of-Memory Killer is a critical component of a resilient system, but it operates within a broader ecosystem of fault tolerance and resource management patterns. These related concepts define the architectural context for building self-healing software.
Bulkhead Pattern
A fault isolation design that partitions system resources—such as thread pools, connections, or memory allocations—into isolated groups (bulkheads). This prevents a failure or resource exhaustion in one component from cascading and bringing down the entire system. For example, a web server might use separate connection pools for user-facing APIs and internal reporting services.
- Key Mechanism: Resource partitioning and quota enforcement.
- Relation to OOM Killer: Both are defensive mechanisms. The Bulkhead pattern proactively isolates faults to prevent system-wide resource exhaustion that could trigger the OOM Killer. The OOM Killer reactively terminates a process after exhaustion occurs.
Circuit Breaker Pattern
A stability design pattern that prevents an application from repeatedly attempting to execute an operation that is likely to fail. It wraps calls to a service and monitors for failures; if failures exceed a threshold, the circuit "opens" and fails fast for a period, allowing the downstream service time to recover.
- Key Mechanism: Fail-fast behavior and automatic recovery probing.
- Relation to OOM Killer: Both protect system stability. A Circuit Breaker stops cascading failures in software logic and network calls, which can prevent scenarios that lead to abnormal memory consumption and subsequent OOM events.
Graceful Degradation
A design philosophy where a system maintains limited, core functionality in the face of partial failures or resource constraints, ensuring a basic level of service rather than a complete outage. This often involves feature toggles, fallback mechanisms, and simplified operational modes.
- Key Mechanism: Prioritization of critical functions and controlled reduction of non-essential features.
- Relation to OOM Killer: Graceful degradation is a proactive, application-level strategy to reduce resource demand before a crisis. The OOM Killer is a reactive, system-level last resort when degradation or other measures have failed to prevent critical memory exhaustion.
Health Probe
A diagnostic check used by an orchestrator (like Kubernetes) to determine the operational status of a service or container. Liveness probes check if the process is running; readiness probes check if it's ready to serve traffic. Failed probes trigger automatic restarts or traffic redirection.
- Key Mechanism: Periodic endpoint checks (HTTP, TCP, command execution).
- Relation to OOM Killer: Health probes are a primary method for detecting unhealthy processes. If a process is killed by the OOM Killer, the orchestrator's liveness probe will eventually fail, triggering a restart of the pod or container, which is a key self-healing action.
Backpressure
A flow control mechanism in data streaming systems where a fast data producer is signaled to slow down or pause to match the processing speed of a slower consumer. This prevents the consumer from being overwhelmed, which can lead to queue overflows, increased latency, and memory exhaustion.
- Key Mechanism: Feedback loops and adaptive rate limiting (e.g., TCP windows, Reactive Streams spec).
- Relation to OOM Killer: Effective backpressure management is a crucial application-level defense against memory exhaustion. Uncontrolled data inflow is a common cause of memory bloat. The OOM Killer acts when backpressure mechanisms are absent or insufficient.
Pod Disruption Budget (PDB)
A Kubernetes API object that limits the number of pods of a replicated application that can be down simultaneously from voluntary disruptions (like node drains or deployments). It ensures a minimum number of available pods or a maximum number of unavailable pods during such operations.
- Key Mechanism: Constraint enforcement on the Kubernetes scheduler.
- Relation to OOM Killer: While a PDB does not govern involuntary disruptions like an OOM kill, it is part of the same high-availability landscape. Understanding PDBs is essential for designing deployments that can withstand the pod termination caused by an OOM event without violating service availability guarantees.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us