Inferensys

Glossary

Task Graph

A task graph is a directed acyclic graph (DAG) that models a parallel program's computational workflow, where nodes represent tasks and edges denote data or control dependencies between them.
ML engineer managing model versions on laptop, version history visible, technical Git-like workflow.
PARALLEL COMPUTING

What is a Task Graph?

A task graph is the fundamental data structure used by parallel programming frameworks to represent and schedule computational workloads.

A task graph is a directed acyclic graph (DAG) that models a parallel computation, where nodes represent individual units of work (tasks) and directed edges represent dependencies—data flow or execution order—between those tasks. This abstraction allows a runtime scheduler to automatically determine which tasks are ready for concurrent execution, enabling efficient exploitation of parallelism across multi-core CPUs, GPUs, or specialized NPUs. The graph's structure inherently prevents deadlock by being acyclic.

In neural network compilation for NPUs, the computational graph of operators (e.g., convolutions, matrix multiplications) is transformed into a highly optimized task graph. The scheduler's goal is to minimize the critical path—the longest sequence of dependent tasks—while respecting hardware constraints like memory bandwidth and core count. Advanced techniques like kernel fusion are often represented as merging nodes in this graph to reduce costly intermediate data transfers, a key optimization for accelerator performance.

ARCHITECTURAL ELEMENTS

Key Components of a Task Graph

A task graph is a directed acyclic graph (DAG) that models a parallel computation. Its structure defines the execution workflow, dependencies, and scheduling constraints for efficient deployment on hardware accelerators like NPUs and GPUs.

01

Nodes (Tasks)

A node represents an individual unit of computation or work within the graph. Each node is a task, which can be a function, a kernel (e.g., a matrix multiplication), or a data transformation. In neural network execution, a node often corresponds to a layer or a fused group of operations. Nodes are characterized by their computational cost, resource requirements (e.g., memory, registers), and the data they produce or consume.

02

Edges (Dependencies)

A directed edge between two nodes represents a data dependency or a control dependency. It defines a producer-consumer relationship, where the output of the source task is required as input by the destination task. The direction of the edge imposes a partial ordering, ensuring tasks execute only after their predecessors have completed. Edges may be annotated with the size and type of data being transferred, which is critical for scheduling and memory allocation.

03

Directed Acyclic Graph (DAG) Structure

The acyclic property is fundamental: there are no cycles or loops in the graph. This guarantees that the computation can complete, as there is no circular wait condition. The DAG structure enables:

  • Static analysis for scheduling and resource allocation.
  • Identification of the critical path (the longest path through the graph), which determines the minimum possible execution time.
  • Safe parallelization, as dependencies are explicit and non-circular.
04

Task Granularity

Granularity refers to the size or computational weight of individual tasks. It is a key trade-off in graph design:

  • Fine-grained tasks are small, numerous, and offer high parallelism but incur significant scheduling and synchronization overhead.
  • Coarse-grained tasks are larger, reduce overhead, but may limit parallelism and load balancing. Optimizers often perform kernel fusion—merging multiple fine-grained operations into a single coarse-grained node—to minimize data movement and launch overhead on accelerators.
05

Dataflow Semantics

Task graphs implement a dataflow programming model, where execution is triggered by the availability of data. The semantics govern how data is passed:

  • Value-based dataflow: Edges transfer actual data tensors or values.
  • Token-based dataflow: Edges transfer control tokens or futures, with data fetched lazily. This model decouples the specification of what to compute from when and where to compute it, enabling dynamic, runtime schedulers to map tasks to available processors.
06

Scheduling & Execution Model

While not part of the static graph definition, the execution model defines how the graph is realized at runtime. Key concepts include:

  • Dynamic Schedulers: Systems that map ready tasks (those with all dependencies satisfied) to worker threads or cores.
  • Work Stealing: A load-balancing technique where idle workers steal tasks from the queues of busy workers.
  • Barrier Synchronization: Points, often implicit at graph completion, where all tasks must finish before proceeding. The graph provides the blueprint; the scheduler orchestrates the live execution on parallel hardware.
PARALLELISM AND SCHEDULING

How Task Graphs Enable Parallel Scheduling

A task graph is the fundamental data structure used by modern compilers and runtime systems to extract parallelism from computational workloads, particularly for neural network execution on hardware accelerators like NPUs.

A task graph is a directed acyclic graph (DAG) that formally represents a parallel computation, where nodes are computational tasks (e.g., kernel executions) and directed edges denote data dependencies or execution ordering constraints between them. This explicit representation of dependencies is what allows a parallel scheduler to identify which tasks are independent and can be executed concurrently on available hardware resources, such as NPU cores or threads, while respecting the program's semantic correctness.

The scheduler analyzes the graph to find the critical path—the longest sequence of dependent tasks—which determines the minimum possible execution time. It then dynamically maps ready tasks to execution units, aiming to maximize occupancy and hide memory latency. Advanced techniques like work stealing dynamically balance load. For NPUs, the compiler first constructs a high-level computational graph from a model (e.g., a neural network), then applies hardware-specific optimizations like kernel fusion before generating the final, scheduled task graph for execution.

PARALLELISM AND SCHEDULING

Task Graph Examples in AI & ML Systems

A task graph is a directed acyclic graph (DAG) that models a computational workflow, where nodes are tasks and edges are dependencies. Below are concrete examples of how they are used to orchestrate parallelism in modern AI systems.

01

Neural Network Training Pipeline

In distributed training, a task graph orchestrates the data-parallel and model-parallel stages. Key tasks include:

  • Data Loading & Augmentation: Parallel tasks fetch and preprocess mini-batches.
  • Forward Pass: Computations flow layer-by-layer; dependencies enforce correct tensor shapes.
  • Loss Calculation & Backward Pass: The autograd system dynamically builds a computation graph where each tensor operation becomes a node. The backward pass traverses this graph in reverse to compute gradients.
  • Gradient Synchronization: An All-Reduce operation (e.g., using NCCL) is a collective task with dependencies on all local gradient calculations.
  • Optimizer Step: The weight update task depends on the synchronized gradients. Frameworks like PyTorch and TensorFlow implicitly or explicitly manage these graphs to schedule work across GPUs or NPUs.
02

Inference Serving with Dynamic Batching

For low-latency model serving, systems like TensorRT, Triton Inference Server, or custom NPU runtimes use task graphs to manage continuous batching.

  • Request Ingestion: Incoming queries become individual sub-graphs.
  • Graph Fusion & Kernel Launch: The scheduler analyzes independent operations (e.g., matrix multiplications, activations) across a batch, fusing them into optimized kernels. Dependencies ensure fused kernels execute in the correct sequence.
  • Memory Lifecycle Management: Tasks for allocating input buffers, executing the graph, and deallocating output buffers are explicitly modeled to prevent memory leaks and overlap computation with data transfer. This graph-based approach allows the scheduler to maximize hardware occupancy by packing operations from multiple requests into a single, efficient execution stream.
03

Compiler Graph Lowering for NPUs

AI compilers (e.g., Apache TVM, MLIR, vendor SDKs) use multi-level task graphs to transform a high-level model into optimized NPU code.

  1. High-Level Graph: The initial graph from a framework (ONNX, PyTorch IR) with abstract operations.
  2. Operator Fusion Graph: The compiler applies kernel fusion rules, merging nodes like convolution, batch norm, and ReLU into a single compound task to minimize intermediate memory writes.
  3. Scheduled Graph: Tasks are assigned to specific hardware units (e.g., tensor cores, vector units). Dependencies model data movement between on-chip SRAM, DRAM, and compute units.
  4. Instruction Graph: The final graph represents low-level, scheduled instructions (microtasks) for the NPU's execution units. This graph is serialized into an executable command stream. This process is the core of hardware-aware model optimization.
04

Multi-Stage Retrieval-Augmented Generation (RAG)

A production RAG pipeline is a classic task graph that coordinates heterogeneous subsystems.

  • Query Understanding & Rewriting: An initial LLM call to clarify the user intent.
  • Vector Search: A parallel task queries a vector database using the rewritten query's embedding. This task is independent of subsequent generation but must complete before it.
  • Re-Ranking & Synthesis: A separate task may re-rank retrieved chunks using a cross-encoder. The top-k chunks become inputs.
  • Contextual Generation: The final LLM call task has a strict dependency on the retrieved context. The graph ensures the context is fully prepared before generation begins.
  • Citation & Logging: Post-generation tasks format the output and log telemetry. Modeling this as a graph allows for async execution of non-critical path tasks (like logging) and easy insertion of caching nodes.
05

Autonomous Agent Planning & Execution

In agentic architectures, a task graph dynamically expands to represent a plan.

  • Goal Decomposition: A planning LLM breaks a high-level objective ("analyze Q3 sales report") into a graph of subtasks: [fetch_data → clean_data → run_analysis → generate_summary].
  • Tool Calling: Each subtask may require executing a tool (API, database query, code execution). These tool calls become nodes with dependencies on their inputs.
  • Conditional & Loop Execution: The graph can include control-flow nodes. For example, an IF node based on the result of run_analysis might add a new investigate_anomaly task branch.
  • Error Handling & Retry: Failed tasks can trigger recursive error correction sub-graphs. The system monitors the graph's critical path for timeouts or failures. Frameworks like LangGraph explicitly use this paradigm to manage state and execution flow across multi-step agentic workflows.
06

Data Preprocessing & Feature Engineering DAGs

In large-scale ML pipelines, tools like Apache Airflow, Kubeflow Pipelines, and Meta's FBLearner use task graphs to manage data workflows.

  • Extract, Transform, Load (ETL): Each data source has a dedicated extraction task. Transformation tasks (normalization, feature encoding) depend on specific extractions completing.
  • Parallel Feature Computation: Independent feature calculations (e.g., computing rolling averages vs. one-hot encoding) are modeled as parallel nodes, significantly reducing wall-clock time.
  • Validation & Joining: A task validates data quality before a final joining task merges all feature sets into a training dataset.
  • Triggering Downstream Jobs: The final node in the graph often triggers the model training task graph. This clear dependency model ensures data is fresh and valid before costly training cycles begin on GPU/TPU/NPU clusters.
COMPARISON

Task Graphs vs. Other Parallelism Models

A feature comparison of the task graph model against other fundamental parallel computing paradigms, highlighting differences in abstraction, scheduling, and data dependency handling.

Feature / CharacteristicTask Graph (DAG)Data ParallelismTask ParallelismPipeline Parallelism

Primary Abstraction

Directed Acyclic Graph (DAG) of tasks

Data partitions / batches

Independent functions / procedures

Sequential stages with buffers

Dependency Model

Explicit edges (data/control)

Implicit (same op on all data)

Implicit or coarse-grained (independent tasks)

Explicit stage-to-stage ordering

Scheduling Granularity

Dynamic or static graph traversal

Fixed data distribution

Coarse-grained task assignment

Fixed pipeline stage assignment

Load Balancing

Work stealing, dynamic scheduling

Static or semi-static data split

Challenging for heterogeneous tasks

Buffer sizing to balance stage throughput

Synchronization Overhead

Dependency satisfaction only

Global barrier per iteration

Barriers or task joins

Inter-stage communication latency

Memory Consistency

Dataflow semantics (producer-consumer)

Bulk synchronous parallel (BSP)

Depends on implementation (e.g., locks)

Streaming with bounded buffers

Optimal Use Case

Irregular, dependency-heavy workflows (e.g., compilation, sparse linear algebra)

Regular, batch-oriented computations (e.g., CNN training, matrix ops)

Embarrassingly parallel, heterogeneous functions

Streaming applications with sequential stages (e.g., video processing, LLM inference)

Hardware Mapping

Runtime maps tasks to available cores

Data mapped to parallel units (e.g., GPU threads)

Tasks mapped to processors/threads

Stages mapped to dedicated processors

TASK GRAPH

Frequently Asked Questions

A task graph is a fundamental data structure in parallel computing that models a program's workflow. This FAQ addresses common questions about its definition, construction, optimization, and role in modern hardware acceleration.

A task graph is a directed acyclic graph (DAG) that represents the computational workflow of a parallel program, where nodes are tasks (units of work) and edges denote data dependencies or control dependencies between them. It works by providing an abstract model for a scheduler or runtime system to analyze. The scheduler identifies tasks with no unresolved dependencies (ready tasks) and dispatches them to available processing units, respecting the partial order defined by the graph to ensure correct execution. This model is crucial for exploiting parallelism while managing complex dependencies in systems like machine learning compilers (e.g., TensorFlow XLA, PyTorch Inductor) and runtime engines.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.