Amodal segmentation is the task of predicting the complete, holistic shape of an object, including its occluded or unseen portions, based solely on its visible parts in an image. It moves beyond standard instance segmentation, which only delineates visible pixels, to infer an object's full spatial extent. This requires models to perform occlusion reasoning, leveraging learned priors about object geometry and scene structure to reconstruct plausible whole shapes from partial observations.
Glossary
Amodal Segmentation

What is Amodal Segmentation?
Amodal segmentation is a computer vision task focused on predicting the complete shape of objects, including parts that are occluded or outside the image frame.
This capability is foundational for embodied intelligence systems and advanced 3D scene understanding, where agents must reason about object permanence and potential interactions with hidden surfaces. It is closely related to visual grounding and panoptic segmentation, but is distinguished by its explicit focus on inferring the unseen. Successful amodal segmentation enables more robust robotic manipulation, improved AR/VR overlays, and richer scene graph generation for complex visual reasoning tasks.
Key Characteristics of Amodal Segmentation
Amodal segmentation is a sophisticated computer vision task that predicts the complete shape of an object, including its occluded or unseen portions, based on visible cues. It requires reasoning beyond direct pixel evidence.
Reasoning Over Occlusion
The core challenge is inferring the complete shape of an object when parts are hidden. This requires the model to leverage:
- Visible cues: The shape, texture, and edges of the visible portion.
- Contextual priors: Learned knowledge about typical object shapes (e.g., a car has four wheels, even if only two are visible).
- Scene geometry: Understanding depth ordering and occlusion relationships between objects.
Unlike standard segmentation, it outputs a full mask that extends into occluded regions.
Inherent Ambiguity
Amodal segmentation is an ill-posed problem—multiple plausible full shapes can explain the same visible portion. For example, the occluded part of a person could be in various poses. Models address this by:
- Learning probabilistic shape priors from training data.
- Often predicting a single most-likely complete mask.
- Some advanced methods may generate multiple hypotheses to capture uncertainty.
Ground truth is also challenging, often requiring human annotators to imagine and draw hidden contours.
Dependence on Instance-Level Understanding
To reason about occlusion, a model must first identify and separate individual object instances. Therefore, amodal segmentation typically builds upon or integrates with:
- Instance segmentation: To obtain initial visible masks and object identities.
- Object detection: For bounding boxes that provide spatial context.
- Depth estimation: To understand which objects are in front and which are behind.
It is fundamentally an instance-aware task, as occlusion reasoning is per-object.
Connection to Physical Reasoning
Predicting occluded shapes is a step towards a physical understanding of scenes. It enables systems to infer:
- Object stability: Predicting full shape helps reason about contact points and support (e.g., a chair leg behind a table).
- Manipulation affordances: A robot can plan grasps on the inferred full handle of a occluded mug.
- Scene completion: Critical for autonomous systems (robots, AR/VR) that must interact with a partially observable world.
This bridges low-level vision with high-level, actionable 3D scene understanding.
Evaluation Metrics
Standard segmentation metrics are adapted, with a focus on the predicted occluded regions. Common metrics include:
- Amodal Intersection over Union (AIoU): IoU calculated using the predicted full mask and the ground truth full mask.
- Visible IoU (VIoU): IoU for just the visible parts, ensuring the visible prediction isn't degraded.
- Occluded IoU (OIoU): IoU specifically for the occluded portion of the object, which is the hardest part.
- Precision/Recall for the occluded region.
Benchmarks like KINS and COCOA provide datasets with amodal annotations.
Common Architectural Approaches
Models often extend existing instance segmentation frameworks (like Mask R-CNN) with specialized modules:
- Shape Prior Networks: Learn a latent space of object shapes from data (e.g., using variational autoencoders).
- Recurrent Refinement: Iteratively expand the visible mask into occluded areas.
- Multi-Task Learning: Jointly predict visible masks, occlusion boundaries, and depth order to inform the amodal prediction.
- Transformer-Based: Use attention mechanisms to aggregate contextual information from across the image and between instances to reason about occlusion.
Many are trained with a combination of visible mask loss and amodal mask loss.
How Amodal Segmentation Works
Amodal segmentation is an advanced computer vision task that predicts the complete shape of an object, including its occluded or unseen parts, based solely on its visible portions in an image.
Amodal segmentation is the computer vision task of predicting the complete shape of an object, including its occluded or unseen parts, based solely on its visible portions in an image. Unlike instance segmentation, which only delineates visible pixels, amodal segmentation infers the full object extent. This requires models to perform occlusion reasoning, using learned priors about object geometry and scene context to hallucinate plausible completions for hidden regions.
The process typically involves a two-stage architecture: first, a standard segmentation model identifies visible object masks; second, a specialized network, often using transformer or generative components, predicts the amodal mask. This is critical for embodied AI and robotics, where understanding an object's full physical properties is necessary for safe manipulation and navigation. It is a foundational capability for visual grounding and reasoning, enabling systems to build a more complete mental model of a scene.
Applications and Use Cases
Amodal segmentation is a critical capability for systems that must reason about the physical world. By inferring complete object shapes, it enables more robust perception and planning in complex, cluttered environments.
Robotics and Embodied AI
For robots manipulating objects or navigating, understanding an object's full extent is essential for safe and effective interaction. Amodal segmentation provides the complete shape, enabling precise grasp planning for occluded handles or edges, accurate path planning around fully understood obstacles, and better state estimation of objects during manipulation (e.g., knowing the full shape of a cup being lifted from a cluttered shelf). This is foundational for visuomotor control policies and task and motion planning.
Autonomous Vehicles
In driving scenarios, vehicles, pedestrians, and obstacles are frequently partially hidden. Amodal segmentation allows a perception system to infer the true size and position of occluded objects, leading to:
- Safer trajectory prediction: Understanding a pedestrian's full body pose even when behind a parked car.
- Improved occupancy mapping: Creating a more complete map of drivable space and static/dynamic obstacles.
- Enhanced risk assessment: Anticipating potential collisions with objects whose visible portion suggests a dangerous full extent (e.g., the rear of a truck extending into the lane).
Augmented and Virtual Reality
For seamless blending of digital content with the real world, AR/VR systems must understand scene geometry at a deep level. Amodal segmentation enables:
- Realistic object occlusion: Digital objects can correctly pass behind and in front of real-world objects based on their inferred complete shapes.
- Persistent content anchoring: Virtual objects can be placed in stable locations, understanding that a real table continues under a book.
- Physics-based interaction: Simulating plausible physical interactions between virtual and real objects requires knowledge of the real objects' complete volumes.
Medical Image Analysis
In medical scans, anatomical structures often overlap or are partially obscured. Amodal reasoning helps in:
- Complete organ segmentation: Inferring the full boundary of an organ partially obscured by another (e.g., the liver behind the ribs in a CT scan).
- Tumor volume estimation: More accurately measuring the total volume of a lesion that may be partially hidden by tissue or bone.
- Surgical planning: Providing surgeons with a complete 3D model of critical structures, including portions not directly visible in a single imaging plane, for pre-operative planning.
Scene Understanding and Reconstruction
Amodal segmentation is a key component of advanced 3D scene understanding. By predicting complete object masks, systems can better reason about:
- Scene composition: Understanding the spatial layout and how objects are arranged in depth.
- Occlusion reasoning: Explicitly modeling what is occluding what, which is crucial for generating coherent scene graphs and for visual commonsense reasoning tasks.
- 3D reconstruction: Amodal 2D masks provide stronger constraints for estimating an object's complete 3D shape from single or multiple views, aiding in creating detailed digital twins or neural radiance fields.
Video Analysis and Tracking
In video, objects constantly enter, exit, and occlude one another. Amodal segmentation enhances temporal consistency and tracking:
- Robust multi-object tracking (MOT): Maintaining object identity through heavy occlusion by relying on the predicted amodal shape and position.
- Inpainting and prediction: Generating plausible video frames during occlusion events by understanding what should be behind an occluder.
- Action recognition: Improving recognition of human actions by understanding the full body pose even when parts are hidden by objects or other people.
Amodal vs. Other Segmentation Tasks
A detailed comparison of amodal segmentation with other core computer vision segmentation tasks, highlighting key differences in objective, output, and handling of occlusion.
| Feature / Metric | Amodal Segmentation | Instance Segmentation | Semantic Segmentation | Panoptic Segmentation |
|---|---|---|---|---|
Primary Objective | Predict the complete shape of objects, including occluded parts. | Detect and delineate each visible instance of an object. | Classify every pixel into a semantic category (stuff). | Unify instance (things) and semantic (stuff) segmentation into a single, non-overlapping output. |
Handling of Occlusion | Explicitly reasons about and predicts occluded regions. | Segments only visible pixels; occluded parts are ignored. | Not applicable; classifies all pixels regardless of instance or occlusion. | For things (countable objects), segments only visible regions like instance segmentation. |
Output for an Object | A single mask representing the object's full extent. | A unique mask for each object instance (visible parts only). | A per-pixel class label (e.g., 'person', 'car'). No instance identity. | For things: a unique instance ID mask (visible). For stuff: a semantic class label. |
Requires Instance Identity | ||||
Predicts 'Stuff' Classes (e.g., sky, road) | ||||
Key Challenge | Inferring plausible geometry for unseen regions. | Separating adjacent objects of the same class. | Achieving fine-grained class boundaries at pixel level. | Resolving conflicts between thing and stuff labels; panoptic quality metric. |
Common Evaluation Metric | Amodal Mask Accuracy (AMA), Intersection-over-Union on full shape. | Average Precision (AP) based on mask IoU. | Mean Intersection-over-Union (mIoU). | Panoptic Quality (PQ), which combines segmentation and recognition quality. |
Foundation Model Example | Specialized extensions of models like SAM for amodal inference. | Mask R-CNN, YOLACT, QueryInst. | FCN, U-Net, DeepLab. | Panoptic FPN, MaskFormer, K-Net. |
Frequently Asked Questions
Amodal segmentation is a critical computer vision task for predicting the complete shape of objects, including occluded parts, enabling robust scene understanding for robotics and autonomous systems.
Amodal segmentation is the computer vision task of predicting the complete shape and extent of an object, including the portions that are occluded or otherwise not visible in the given image. Unlike standard instance segmentation, which only segments the visible pixels, amodal segmentation infers the full object mask, reasoning about what lies behind other objects. This is essential for applications requiring a complete understanding of scene geometry and object interactions, such as robotic manipulation, autonomous navigation, and advanced augmented reality.
Key differentiators from related tasks:
- Instance Segmentation: Segments only the visible portion of each object instance.
- Semantic Segmentation: Classifies every pixel with a semantic label but does not distinguish instances or infer occluded areas.
- Panoptic Segmentation: Unifies semantic and instance segmentation but still operates on visible pixels only.
- Amodal Segmentation: Predicts the full object mask, including the occluded region, which is often represented as a separate mask or a distinct visual channel.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Amodal segmentation is a core task within visual grounding, requiring models to infer complete object shapes from partial observations. These related concepts define the broader technical landscape of scene understanding.
Occlusion Reasoning
Occlusion reasoning is the broader cognitive process by which a vision system infers the presence, shape, or properties of objects that are partially or fully hidden by others. It is the foundational capability that amodal segmentation aims to automate.
- Core Challenge: Distinguishing between object boundaries and occlusion boundaries.
- Approaches: Include geometric reasoning, depth ordering, and learning statistical priors about object shapes from large datasets.
- Application: Critical for robotics manipulation (grasping occluded objects), autonomous navigation (predicting hidden obstacles), and augmented reality (realistic object insertion).
Instance Segmentation
Instance segmentation is the task of detecting and delineating each distinct object of interest in an image, assigning a unique mask to each visible instance. It is the modal (visible-only) counterpart to amodal segmentation.
- Key Difference: Generates masks only for visible pixels, treating occluded areas as background.
- Common Architectures: Mask R-CNN, YOLACT, and query-based models like Mask2Former.
- Output: A set of masks and class labels for each visible object instance. Amodal segmentation builds upon this by predicting the complete shape, including the occluded volume.
Panoptic Segmentation
Panoptic segmentation is a unified task that requires classifying every pixel in an image with both a semantic label (e.g., 'road', 'sky') and, for 'thing' categories, a unique instance ID. It unifies semantic and instance segmentation but is typically modal.
- Two Types of Regions: 'Stuff' (amorphous, uncountable areas) and 'Things' (countable objects).
- Amodal Extension: Some research explores amodal panoptic segmentation, which aims to provide complete instance masks for all 'Things' while maintaining the 'Stuff' labeling, creating a holistic, complete scene understanding.
Scene Graph Generation
Scene Graph Generation parses an image into a structured graph representation where nodes are detected objects and edges are their pairwise relationships (e.g., 'riding', 'next to') or attributes. Understanding occlusion is often implicit for accurate relationship prediction.
- Structured Output: Represents a scene as
(subject, predicate, object)triplets. - Connection to Amodal: Knowing an object's full amodal extent can resolve ambiguous relationships (e.g., 'person behind table' vs. 'person next to table').
- Use Case: Enables complex visual reasoning, querying ('find the person wearing a hat'), and image generation from descriptive graphs.
Compositional Generalization
Compositional generalization is the ability of a model to understand and combine known concepts (objects, attributes, relations) in novel, unseen ways. For amodal segmentation, this means correctly segmenting an object in a novel occlusion context not seen during training.
- Core Problem: Models often fail when test-time object compositions or occlusion patterns differ from the training distribution.
- Testing for Amodal: Requires benchmarks with systematic splits where occluder/occludee pairs are separated between train and test sets.
- Solution Direction: Encouraging models to learn disentangled representations of shape, texture, and context.
Visual Commonsense Reasoning
Visual Commonsense Reasoning is the task of answering questions about an image that require understanding of implicit, real-world knowledge and physical laws beyond what is directly depicted. Amodal perception is a form of physical commonsense.
- Example Question: 'Could the person in the image see the dog?' requires reasoning about occlusion and line-of-sight.
- Amodal as Foundation: Inferring complete object shapes relies on priors about object continuity, solidity, and typical geometry—all facets of visual commonsense.
- Benchmarks: Datasets like VCR (Visual Commonsense Reasoning) and VQA-CP probe these capabilities.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us