A task graph is a directed acyclic graph (DAG) that models a parallel computation, where nodes represent individual units of work (tasks) and directed edges represent dependencies—data flow or execution order—between those tasks. This abstraction allows a runtime scheduler to automatically determine which tasks are ready for concurrent execution, enabling efficient exploitation of parallelism across multi-core CPUs, GPUs, or specialized NPUs. The graph's structure inherently prevents deadlock by being acyclic.
Glossary
Task Graph

What is a Task Graph?
A task graph is the fundamental data structure used by parallel programming frameworks to represent and schedule computational workloads.
In neural network compilation for NPUs, the computational graph of operators (e.g., convolutions, matrix multiplications) is transformed into a highly optimized task graph. The scheduler's goal is to minimize the critical path—the longest sequence of dependent tasks—while respecting hardware constraints like memory bandwidth and core count. Advanced techniques like kernel fusion are often represented as merging nodes in this graph to reduce costly intermediate data transfers, a key optimization for accelerator performance.
Key Components of a Task Graph
A task graph is a directed acyclic graph (DAG) that models a parallel computation. Its structure defines the execution workflow, dependencies, and scheduling constraints for efficient deployment on hardware accelerators like NPUs and GPUs.
Nodes (Tasks)
A node represents an individual unit of computation or work within the graph. Each node is a task, which can be a function, a kernel (e.g., a matrix multiplication), or a data transformation. In neural network execution, a node often corresponds to a layer or a fused group of operations. Nodes are characterized by their computational cost, resource requirements (e.g., memory, registers), and the data they produce or consume.
Edges (Dependencies)
A directed edge between two nodes represents a data dependency or a control dependency. It defines a producer-consumer relationship, where the output of the source task is required as input by the destination task. The direction of the edge imposes a partial ordering, ensuring tasks execute only after their predecessors have completed. Edges may be annotated with the size and type of data being transferred, which is critical for scheduling and memory allocation.
Directed Acyclic Graph (DAG) Structure
The acyclic property is fundamental: there are no cycles or loops in the graph. This guarantees that the computation can complete, as there is no circular wait condition. The DAG structure enables:
- Static analysis for scheduling and resource allocation.
- Identification of the critical path (the longest path through the graph), which determines the minimum possible execution time.
- Safe parallelization, as dependencies are explicit and non-circular.
Task Granularity
Granularity refers to the size or computational weight of individual tasks. It is a key trade-off in graph design:
- Fine-grained tasks are small, numerous, and offer high parallelism but incur significant scheduling and synchronization overhead.
- Coarse-grained tasks are larger, reduce overhead, but may limit parallelism and load balancing. Optimizers often perform kernel fusion—merging multiple fine-grained operations into a single coarse-grained node—to minimize data movement and launch overhead on accelerators.
Dataflow Semantics
Task graphs implement a dataflow programming model, where execution is triggered by the availability of data. The semantics govern how data is passed:
- Value-based dataflow: Edges transfer actual data tensors or values.
- Token-based dataflow: Edges transfer control tokens or futures, with data fetched lazily. This model decouples the specification of what to compute from when and where to compute it, enabling dynamic, runtime schedulers to map tasks to available processors.
Scheduling & Execution Model
While not part of the static graph definition, the execution model defines how the graph is realized at runtime. Key concepts include:
- Dynamic Schedulers: Systems that map ready tasks (those with all dependencies satisfied) to worker threads or cores.
- Work Stealing: A load-balancing technique where idle workers steal tasks from the queues of busy workers.
- Barrier Synchronization: Points, often implicit at graph completion, where all tasks must finish before proceeding. The graph provides the blueprint; the scheduler orchestrates the live execution on parallel hardware.
How Task Graphs Enable Parallel Scheduling
A task graph is the fundamental data structure used by modern compilers and runtime systems to extract parallelism from computational workloads, particularly for neural network execution on hardware accelerators like NPUs.
A task graph is a directed acyclic graph (DAG) that formally represents a parallel computation, where nodes are computational tasks (e.g., kernel executions) and directed edges denote data dependencies or execution ordering constraints between them. This explicit representation of dependencies is what allows a parallel scheduler to identify which tasks are independent and can be executed concurrently on available hardware resources, such as NPU cores or threads, while respecting the program's semantic correctness.
The scheduler analyzes the graph to find the critical path—the longest sequence of dependent tasks—which determines the minimum possible execution time. It then dynamically maps ready tasks to execution units, aiming to maximize occupancy and hide memory latency. Advanced techniques like work stealing dynamically balance load. For NPUs, the compiler first constructs a high-level computational graph from a model (e.g., a neural network), then applies hardware-specific optimizations like kernel fusion before generating the final, scheduled task graph for execution.
Task Graph Examples in AI & ML Systems
A task graph is a directed acyclic graph (DAG) that models a computational workflow, where nodes are tasks and edges are dependencies. Below are concrete examples of how they are used to orchestrate parallelism in modern AI systems.
Neural Network Training Pipeline
In distributed training, a task graph orchestrates the data-parallel and model-parallel stages. Key tasks include:
- Data Loading & Augmentation: Parallel tasks fetch and preprocess mini-batches.
- Forward Pass: Computations flow layer-by-layer; dependencies enforce correct tensor shapes.
- Loss Calculation & Backward Pass: The autograd system dynamically builds a computation graph where each tensor operation becomes a node. The backward pass traverses this graph in reverse to compute gradients.
- Gradient Synchronization: An All-Reduce operation (e.g., using NCCL) is a collective task with dependencies on all local gradient calculations.
- Optimizer Step: The weight update task depends on the synchronized gradients. Frameworks like PyTorch and TensorFlow implicitly or explicitly manage these graphs to schedule work across GPUs or NPUs.
Inference Serving with Dynamic Batching
For low-latency model serving, systems like TensorRT, Triton Inference Server, or custom NPU runtimes use task graphs to manage continuous batching.
- Request Ingestion: Incoming queries become individual sub-graphs.
- Graph Fusion & Kernel Launch: The scheduler analyzes independent operations (e.g., matrix multiplications, activations) across a batch, fusing them into optimized kernels. Dependencies ensure fused kernels execute in the correct sequence.
- Memory Lifecycle Management: Tasks for allocating input buffers, executing the graph, and deallocating output buffers are explicitly modeled to prevent memory leaks and overlap computation with data transfer. This graph-based approach allows the scheduler to maximize hardware occupancy by packing operations from multiple requests into a single, efficient execution stream.
Compiler Graph Lowering for NPUs
AI compilers (e.g., Apache TVM, MLIR, vendor SDKs) use multi-level task graphs to transform a high-level model into optimized NPU code.
- High-Level Graph: The initial graph from a framework (ONNX, PyTorch IR) with abstract operations.
- Operator Fusion Graph: The compiler applies kernel fusion rules, merging nodes like convolution, batch norm, and ReLU into a single compound task to minimize intermediate memory writes.
- Scheduled Graph: Tasks are assigned to specific hardware units (e.g., tensor cores, vector units). Dependencies model data movement between on-chip SRAM, DRAM, and compute units.
- Instruction Graph: The final graph represents low-level, scheduled instructions (microtasks) for the NPU's execution units. This graph is serialized into an executable command stream. This process is the core of hardware-aware model optimization.
Multi-Stage Retrieval-Augmented Generation (RAG)
A production RAG pipeline is a classic task graph that coordinates heterogeneous subsystems.
- Query Understanding & Rewriting: An initial LLM call to clarify the user intent.
- Vector Search: A parallel task queries a vector database using the rewritten query's embedding. This task is independent of subsequent generation but must complete before it.
- Re-Ranking & Synthesis: A separate task may re-rank retrieved chunks using a cross-encoder. The top-k chunks become inputs.
- Contextual Generation: The final LLM call task has a strict dependency on the retrieved context. The graph ensures the context is fully prepared before generation begins.
- Citation & Logging: Post-generation tasks format the output and log telemetry. Modeling this as a graph allows for async execution of non-critical path tasks (like logging) and easy insertion of caching nodes.
Autonomous Agent Planning & Execution
In agentic architectures, a task graph dynamically expands to represent a plan.
- Goal Decomposition: A planning LLM breaks a high-level objective ("analyze Q3 sales report") into a graph of subtasks:
[fetch_data → clean_data → run_analysis → generate_summary]. - Tool Calling: Each subtask may require executing a tool (API, database query, code execution). These tool calls become nodes with dependencies on their inputs.
- Conditional & Loop Execution: The graph can include control-flow nodes. For example, an
IFnode based on the result ofrun_analysismight add a newinvestigate_anomalytask branch. - Error Handling & Retry: Failed tasks can trigger recursive error correction sub-graphs. The system monitors the graph's critical path for timeouts or failures. Frameworks like LangGraph explicitly use this paradigm to manage state and execution flow across multi-step agentic workflows.
Data Preprocessing & Feature Engineering DAGs
In large-scale ML pipelines, tools like Apache Airflow, Kubeflow Pipelines, and Meta's FBLearner use task graphs to manage data workflows.
- Extract, Transform, Load (ETL): Each data source has a dedicated extraction task. Transformation tasks (normalization, feature encoding) depend on specific extractions completing.
- Parallel Feature Computation: Independent feature calculations (e.g., computing rolling averages vs. one-hot encoding) are modeled as parallel nodes, significantly reducing wall-clock time.
- Validation & Joining: A task validates data quality before a final joining task merges all feature sets into a training dataset.
- Triggering Downstream Jobs: The final node in the graph often triggers the model training task graph. This clear dependency model ensures data is fresh and valid before costly training cycles begin on GPU/TPU/NPU clusters.
Task Graphs vs. Other Parallelism Models
A feature comparison of the task graph model against other fundamental parallel computing paradigms, highlighting differences in abstraction, scheduling, and data dependency handling.
| Feature / Characteristic | Task Graph (DAG) | Data Parallelism | Task Parallelism | Pipeline Parallelism |
|---|---|---|---|---|
Primary Abstraction | Directed Acyclic Graph (DAG) of tasks | Data partitions / batches | Independent functions / procedures | Sequential stages with buffers |
Dependency Model | Explicit edges (data/control) | Implicit (same op on all data) | Implicit or coarse-grained (independent tasks) | Explicit stage-to-stage ordering |
Scheduling Granularity | Dynamic or static graph traversal | Fixed data distribution | Coarse-grained task assignment | Fixed pipeline stage assignment |
Load Balancing | Work stealing, dynamic scheduling | Static or semi-static data split | Challenging for heterogeneous tasks | Buffer sizing to balance stage throughput |
Synchronization Overhead | Dependency satisfaction only | Global barrier per iteration | Barriers or task joins | Inter-stage communication latency |
Memory Consistency | Dataflow semantics (producer-consumer) | Bulk synchronous parallel (BSP) | Depends on implementation (e.g., locks) | Streaming with bounded buffers |
Optimal Use Case | Irregular, dependency-heavy workflows (e.g., compilation, sparse linear algebra) | Regular, batch-oriented computations (e.g., CNN training, matrix ops) | Embarrassingly parallel, heterogeneous functions | Streaming applications with sequential stages (e.g., video processing, LLM inference) |
Hardware Mapping | Runtime maps tasks to available cores | Data mapped to parallel units (e.g., GPU threads) | Tasks mapped to processors/threads | Stages mapped to dedicated processors |
Frequently Asked Questions
A task graph is a fundamental data structure in parallel computing that models a program's workflow. This FAQ addresses common questions about its definition, construction, optimization, and role in modern hardware acceleration.
A task graph is a directed acyclic graph (DAG) that represents the computational workflow of a parallel program, where nodes are tasks (units of work) and edges denote data dependencies or control dependencies between them. It works by providing an abstract model for a scheduler or runtime system to analyze. The scheduler identifies tasks with no unresolved dependencies (ready tasks) and dispatches them to available processing units, respecting the partial order defined by the graph to ensure correct execution. This model is crucial for exploiting parallelism while managing complex dependencies in systems like machine learning compilers (e.g., TensorFlow XLA, PyTorch Inductor) and runtime engines.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
A task graph is a foundational abstraction for parallel execution. These related concepts define the scheduling, execution, and hardware models that bring a task graph to life on modern accelerators.
Critical Path
The critical path is the longest sequence of dependent tasks in a task graph from the start node to the finish node. It determines the minimum possible execution time (makespan) for the entire parallel computation. Optimizing a task graph often involves identifying and shortening its critical path through techniques like task fusion or asynchronous execution of non-critical dependencies.
Work Stealing
Work stealing is a dynamic, decentralized load-balancing scheduling algorithm used to execute task graphs efficiently. Idle processors (or threads) 'steal' tasks from the queues of busy processors. This is highly effective for irregular or imbalanced task graphs where task costs are unknown at compile time, as it adaptively redistributes work without centralized coordination. Frameworks like Intel's Threading Building Blocks (TBB) and C++17's parallel algorithms often use this strategy.
Data Race
A data race is a concurrency bug that can occur during the parallel execution of a task graph. It happens when two or more tasks access the same memory location concurrently, at least one access is a write, and the accesses are not ordered by explicit dependencies or synchronization. Correct task graph construction must define data dependencies (edges) to prevent races, ensuring deterministic execution. Tools like ThreadSanitizer are used to detect such violations.
SIMT (Single Instruction, Multiple Threads)
SIMT is the execution model used by modern GPUs and many NPUs. It extends SIMD by issuing a single instruction to a warp or wavefront of threads, each executing it on its own data. This model is crucial for understanding how fine-grained parallel tasks (like individual neurons in a layer) are mapped to hardware. Task graphs for these architectures must be compiled into kernels that efficiently utilize the SIMT paradigm, managing thread divergence and memory coalescing.
Barrier Synchronization
A barrier is a synchronization primitive that forces all tasks in a parallel group to reach a specific point before any can proceed. In a task graph, barriers are often implicit at the join of multiple dependent tasks. They are essential for correctness in phases of computation (e.g., between forward and backward passes in training) but can create performance bottlenecks if they cause processors to idle. Asynchronous execution and double-buffering are techniques to hide this latency.
Amdahl's Law
Amdahl's Law is a fundamental formula that predicts the maximum theoretical speedup of a parallel program: Speedup = 1 / (S + P/N), where S is the serial fraction, P is the parallelizable fraction, and N is the number of processors. It highlights that the serial sections of a task graph (nodes on the critical path with no parallelism) ultimately limit performance gains. This law drives the compiler optimization goal of minimizing serialization and maximizing task parallelism within graphs.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us