Glossary

Profiling (CPU/GPU)

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks, utilizing tools to analyze time spent on CPU operations, GPU kernels, and memory transfers.

Get in touch Learn more

Operations room with a large monitor wall for system visibility and control.

LATENCY BENCHMARKING

What is Profiling (CPU/GPU)?

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks, utilizing tools like PyTorch Profiler, NVIDIA Nsight, or flame graphs to analyze time spent on CPU operations, GPU kernels, and memory transfers.

Profiling (CPU/GPU) is the systematic, low-level measurement of a program's execution to identify performance bottlenecks and resource utilization. It involves instrumenting code to collect granular timing data on CPU operations, GPU kernel execution, and memory transfers. This data is visualized using tools like flame graphs and timelines, enabling engineers to pinpoint the exact lines of code, functions, or hardware operations causing latency. Profiling is distinct from high-level monitoring, as it provides the causal detail needed for bottleneck identification and targeted optimization.

For AI inference, profiling is critical for latency reduction. It reveals inefficiencies in the model execution graph, excessive GPU kernel launch overhead, or suboptimal memory access patterns. Tools like PyTorch Profiler, TensorRT, and NVIDIA Nsight Systems capture these metrics. The output guides optimizations such as operator fusion, model quantization, and improved batching strategies. Effective profiling transforms qualitative performance guesses into quantitative, actionable data, forming the empirical foundation for meeting strict Service Level Objectives (SLOs) for latency.

PROFILING (CPU/GPU)

Key Metrics and Events Captured by Profilers

Profilers instrument code execution to capture granular performance data, enabling engineers to identify the precise computational and memory operations causing latency bottlenecks.

CPU Time & Wall Clock Time

CPU Time measures the total time the CPU spends actively executing instructions for a specific function or operation, excluding time spent waiting for I/O or other processes. Wall Clock Time (or elapsed time) measures the total real-world time from start to finish, including all waits and overheads. The discrepancy between the two highlights I/O bottlenecks or contention for system resources.

Example: A data preprocessing function may show low CPU time but high wall clock time, indicating it's stalled waiting for disk reads or network calls.

GPU Kernel Execution

Profilers track the launch and duration of GPU kernels—the parallel functions executed on the GPU. Metrics include:

Kernel Runtime: Time spent executing each CUDA or ROCm kernel.
Occupancy: The ratio of active warps (thread groups) to the maximum supported by the streaming multiprocessor (SM), indicating utilization efficiency.
Launch Overhead: Latency between kernel invocation on the CPU and execution start on the GPU. Inefficient, small kernels can be dominated by this overhead.

Tools like NVIDIA Nsight Systems or PyTorch Profiler visualize these timelines, showing which model layers trigger specific kernels.

Memory Operations & Transfers

A primary source of latency in GPU-accelerated workloads. Profilers measure:

Host-to-Device (H2D) & Device-to-Host (D2H) Transfers: Time spent moving data between CPU (host) and GPU (device) memory over the PCIe bus. Excessive transfers are a common bottleneck.
GPU Memory Allocation/Deallocation: Overhead from dynamic memory management APIs like cudaMalloc.
GPU Kernel Memory Accesses: Profilers can identify memory-bound kernels where execution time is limited by the speed of reading/writing to GPU global or shared memory, versus compute-bound kernels limited by arithmetic logic unit (ALU) throughput.

Operator-Level Breakdown

Deep learning profilers decompose the model's execution graph and time individual operators (ops). This reveals which layers consume the most inference time.

Attention vs. FFN Layers: In transformers, profile the time for multi-head attention versus feed-forward network blocks.
Convolution Parameters: For CNNs, profile by kernel size, stride, and input/output channels.
Framework Overhead: Time spent in the framework's Python-to-C++ dispatch layer versus the core compute kernel.

PyTorch Profiler with TensorBoard provides flame graphs showing time spent per operator, such as aten::matmul or aten::convolution.

CUDA API & Runtime Calls

Profilers intercept calls to the CUDA driver and runtime APIs to quantify their latency contribution.

Key traced calls include:

cudaMemcpyAsync: For asynchronous memory transfers.
cudaLaunchKernel: For launching GPU kernels.
cudaStreamSynchronize: Time spent waiting for a stream of operations to complete.
cudaMalloc / cudaFree: Dynamic memory management overhead.

Frequent, small API calls can create significant CPU-side overhead, limiting overall throughput. Profiling helps batch operations or adjust stream usage to minimize this.

System-Level Context

Advanced profilers correlate application metrics with system-wide resource utilization to identify external bottlenecks.

CPU Utilization: Per-core usage to detect single-threaded bottlenecks or excessive context switching.
GPU Utilization: The percentage of time the GPU's engines (graphics, compute, copy) are active.
PCIe Bandwidth: Saturation of the bus connecting CPU and GPU.
Power & Thermal Throttling: Events where the GPU or CPU reduces clock speed due to temperature or power limits, causing sudden latency spikes.

Tools like NVIDIA DCGM (Data Center GPU Manager) or Linux perf provide this system-level view alongside application traces.

LATENCY BENCHMARKING

How Does Profiling Work?

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks, utilizing specialized tools to analyze CPU, GPU, and memory operations.

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks, utilizing tools like PyTorch Profiler, NVIDIA Nsight, or flame graphs. It works by instrumenting code to collect granular timing data on CPU operations, GPU kernel execution, and memory transfers. This data reveals the precise functions, operators, or hardware components consuming the most time, enabling targeted optimization to reduce inference latency and improve throughput.

The process typically involves tracing the model execution graph to pinpoint inefficiencies such as excessive GPU kernel launch overhead, suboptimal operator fusion, or memory bandwidth saturation. By analyzing profiles, engineers can validate the impact of optimizations like model quantization or continuous batching. Effective profiling is foundational for establishing a performance baseline, meeting Service Level Objectives (SLOs) for latency, and conducting bottleneck identification in production AI systems.

PROFILING (CPU/GPU)

Common Profiling Tools and Frameworks

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks. The following tools and frameworks are essential for analyzing time spent on CPU operations, GPU kernels, and memory transfers in AI workloads.

PyTorch Profiler

The PyTorch Profiler is a native performance debugging tool integrated into the PyTorch framework. It provides a unified view of CPU and GPU execution, crucial for deep learning workloads.

Key Features: Tracks operator execution time, GPU kernel activity, memory usage per operator, and data transfer between CPU and GPU.
Output Formats: Generates Chrome Trace Viewer files for timeline visualization and can export to TensorBoard for aggregated statistics.
Use Case: Ideal for identifying slow operators in a model's forward/backward pass, detecting excessive CPU-GPU synchronization, and analyzing memory allocation patterns during training or inference.

EXPLORE

NVIDIA Nsight Systems

NVIDIA Nsight Systems is a system-wide performance analysis tool designed for heterogeneous computing environments using NVIDIA GPUs. It provides a low-overhead timeline of all system activity.

Key Features: Captures GPU kernel execution, CUDA API calls, CPU threads, multi-process activity, and data transfers (PCIe, NVLink). It correlates CPU and GPU timelines to identify bottlenecks.
System-Level View: Unlike pure GPU profilers, it shows how the CPU scheduling and driver interactions impact GPU utilization, which is critical for optimizing inference servers.
Use Case: Essential for diagnosing issues like low GPU utilization due to CPU bottlenecks, kernel launch overheads, and inefficient pipeline parallelism in complex applications.

EXPLORE

TensorFlow Profiler

The TensorFlow Profiler is integrated into TensorBoard and provides detailed instrumentation for performance analysis of TensorFlow models across CPUs, GPUs, and TPUs.

Key Features: Offers overview pages for high-level performance summary, trace viewer for detailed timeline analysis, and tools for profiling input pipelines to detect data starvation.
TPU Support: Includes specialized profiling capabilities for Tensor Processing Units, analyzing TPU matrix multiplication unit (MXU) utilization.
Use Case: Used to find inefficient model architectures, optimize input data pipelines to prevent the GPU from idling, and debug performance issues specific to TensorFlow's execution graph.

EXPLORE

Flame Graphs

A flame graph is a visualization of hierarchical, nested data, originally created for profiling stack traces. It is a language-agnostic tool for identifying the most frequent code paths consuming CPU resources.

Visualization: The x-axis shows the stack profile population, sorted alphabetically. The y-axis shows stack depth. Each rectangle represents a stack frame, and its width shows how often it was present in the profile.
Creation Tools: Generated from profiling data collected by tools like perf (Linux), py-spy (for Python), or go tool pprof.
Use Case: Excellent for identifying 'hot' functions in inference serving code, Python preprocessing overhead, or framework internals that dominate execution time, providing a quick visual bottleneck summary.

EXPLORE

CUDA Profiling Tools Interface (CUPTI)

CUPTI is a low-level API that enables the creation of profiling and tracing tools for CUDA applications. It is the underlying engine for higher-level tools like Nsight Compute and the PyTorch Profiler's GPU metrics.

Low-Level Metrics: Provides direct access to hardware performance counters, including SM (Streaming Multiprocessor) utilization, memory throughput, warp execution efficiency, and instruction replay statistics.
Activity Tracing: Enables tracing of CUDA runtime and driver API calls, kernel executions, and memory transfers with precise timestamps.
Use Case: Used by tool developers and for deep, custom analysis of GPU kernel performance, such as diagnosing warp divergence, shared memory bank conflicts, or memory-bound versus compute-bound kernels.

EXPLORE

ONNX Runtime Profiling

ONNX Runtime provides built-in profiling capabilities for models exported in the Open Neural Network Exchange (ONNX) format. It is critical for optimizing inference performance across diverse hardware backends.

Backend-Agnostic: Profiles execution regardless of the chosen Execution Provider (EP), such as CUDA, TensorRT, CPU, or OpenVINO.
Node-Level Timing: Breaks down inference time by individual ONNX graph node, showing which operators are the slowest in the optimized execution graph.
Use Case: Used to validate the performance impact of different EPs, identify suboptimal graph partitions in multi-device inference, and compare latency before and after applying graph optimizations like operator fusion.

EXPLORE

PROFILING (CPU/GPU)

Frequently Asked Questions

Profiling is the systematic measurement of a program's execution to identify performance bottlenecks. This FAQ addresses common questions about using profiling tools to analyze CPU operations, GPU kernels, and memory transfers for optimizing AI inference latency.

Profiling is the systematic measurement and analysis of a program's execution to identify performance bottlenecks in CPU, GPU, and memory subsystems. For AI inference, profiling is critical because it provides the empirical data needed to optimize end-to-end latency and throughput, moving beyond guesswork to data-driven performance tuning. It reveals whether delays are caused by inefficient GPU kernel execution, excessive CPU-GPU synchronization, memory bandwidth saturation, or framework overhead. Without profiling, optimization efforts are blind, often targeting the wrong component and wasting engineering resources. Tools like PyTorch Profiler, NVIDIA Nsight Systems, and TensorBoard generate detailed timelines and metrics that pinpoint the exact operations consuming the most time, enabling precise interventions such as operator fusion, kernel optimization, or batch size adjustment.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

LATENCY BENCHMARKING

Related Terms

Profiling is a core technique within latency benchmarking. These related concepts detail the specific metrics, tools, and optimization targets that profiling activities aim to measure and improve.

Inference Latency

The total time delay between submitting an input to a machine learning model and receiving its output. Profiling decomposes this latency into constituent parts:

Prefilling: Processing the static prompt.
Decoding: Autoregressive token generation.
GPU Kernel Execution: Time spent in compute kernels.
Memory Transfers: Data movement between CPU and GPU (PCIe overhead).

Bottleneck Identification

The primary goal of profiling. It involves analyzing execution traces to pinpoint the limiting resource or operation. Common bottlenecks include:

GPU Compute-Bound: Saturation of streaming multiprocessors (SMs).
Memory-Bound: Bandwidth limitations on GPU VRAM or system RAM.
Kernel Launch Overhead: Excessive latency from frequent, small kernel dispatches.
CPU Serialization: Host-side preprocessing or data loading blocking the GPU.

GPU Kernel Launch Overhead

The fixed latency cost of scheduling a computational kernel on the GPU. Profilers like NVIDIA Nsight Systems measure this directly. This overhead is critical for small operations and can be mitigated by:

Operator Fusion: Combining sequential ops into a single kernel.
Kernel Consolidation: Rewriting model code to use fewer, larger kernels.
Using CUDA Graphs: Capturing a sequence of kernels into a single, replayable graph to eliminate repeated launch overhead.

Flame Graphs

A visualization tool for profiling data, representing hierarchical call stacks as horizontal flames. Wide sections indicate functions consuming significant time. Essential for interpreting profiler output to:

Identify Hot Paths: See which functions dominate execution time.
Understand Call Hierarchies: Trace performance costs from high-level framework code down to low-level CUDA kernels.
Compare Profiles: Visually diff flame graphs before and after optimizations.

EXPLORE

Operator Fusion

A compiler-level optimization identified and validated through profiling. It merges multiple sequential neural network operations into a single GPU kernel. Benefits include:

Reduced Kernel Launches: Lowers GPU kernel launch overhead.
Fewer Memory Accesses: Intermediate tensors are kept in registers/cache instead of written to global memory.
Framework Support: Automatically performed by TensorRT, ONNX Runtime, and PyTorch's torch.compile with Inductor backend.

Model Execution Graph

An optimized, static representation of a neural network's computational operations produced by inference compilers. Profiling often targets the execution of this graph. Key aspects:

Static Optimization: Enables ahead-of-time operator fusion, kernel selection, and memory planning.
Profiler Integration: Tools like PyTorch Profiler can trace execution at the granularity of graph nodes.
Runtime Overhead Reduction: Eliminates interpreter overhead present in eager execution modes.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Profiling (CPU/GPU)

What is Profiling (CPU/GPU)?

Key Metrics and Events Captured by Profilers

CPU Time & Wall Clock Time

GPU Kernel Execution

Memory Operations & Transfers

Operator-Level Breakdown

CUDA API & Runtime Calls

System-Level Context

How Does Profiling Work?

Common Profiling Tools and Frameworks

PyTorch Profiler

NVIDIA Nsight Systems

TensorFlow Profiler

Flame Graphs

CUDA Profiling Tools Interface (CUPTI)

ONNX Runtime Profiling

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Flame Graphs

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there