Glossary

FLOPs (Floating Point Operations)

FLOPs (Floating Point Operations) is a measure of the computational cost of a machine learning model, representing the total number of floating-point arithmetic operations required for a single forward pass.

Get in touch Learn more

ML engineer managing model training cluster on laptop, GPU utilization visible, technical deep learning setup.

What is FLOPs (Floating Point Operations)?

FLOPs (Floating Point Operations) is a fundamental metric for quantifying the computational cost of a machine learning model, representing the total number of floating-point arithmetic operations required for a single forward pass.

A FLOP counts one basic floating-point operation, such as an addition, subtraction, multiplication, or division. In deep learning, FLOPs (often written as FLOPS when referring to rate) aggregate the operations from matrix multiplications, convolutions, and activation functions within a neural network. This count serves as a hardware-agnostic measure of a model's intrinsic computational demand, distinct from its actual runtime, which depends on memory bandwidth and hardware optimization. It is a primary metric in model benchmarking suites for comparing architectural efficiency.

Engineers use FLOPs to estimate training costs, guide inference optimization, and design models for constrained environments like edge AI architectures. While crucial, FLOPs alone do not capture memory access costs or parallelizability, which are critical for real-world latency. Therefore, it is used alongside metrics like parameter count and actual latency benchmarking on target hardware to provide a complete performance profile during evaluation-driven development.

COMPUTATIONAL COST

Key Characteristics of FLOPs

FLOPs (Floating Point Operations) quantify the raw arithmetic workload of a neural network. Understanding its characteristics is essential for model design, hardware selection, and cost estimation.

Definition and Core Arithmetic

A FLOP is a single floating-point arithmetic operation, such as an addition, subtraction, multiplication, or division. In deep learning, the vast majority of FLOPs come from matrix multiplications and convolutions. For example, multiplying two n x n matrices requires approximately 2n³ FLOPs. This count excludes memory access operations, focusing purely on the computational kernel.

Forward vs. Backward Pass

FLOP counts are typically reported for a single forward pass of one data sample through the model. The backward pass during training, which computes gradients via backpropagation, generally requires roughly twice the FLOPs of the forward pass. Therefore, a model with 100 GFLOPs for inference may require ~300 GFLOPs per sample for a full training step (forward + backward).

Hardware-Dependent Throughput

FLOPs measure theoretical peak computation, not actual runtime. Real-world speed depends on:

Hardware FLOP/s (FLOPS): The processor's maximum operations per second (e.g., an NVIDIA H100 delivers ~67 TFLOPS for FP16).
Memory Bandwidth: If operations are memory-bound, high FLOP counts don't translate to speed.
Kernel Optimization: Efficiently mapped operations (e.g., using Tensor Cores) achieve a higher percentage of peak FLOPS. Thus, a lower-FLOP model may run slower if it's poorly optimized for the hardware.

Proxy for Energy Consumption

FLOPs serve as a first-order proxy for energy cost and carbon footprint. Each floating-point operation consumes a measurable amount of energy on silicon. Training large models like GPT-3 (estimated at 3.14e23 FLOPs) directly correlates with massive megawatt-hour consumption. Optimizing FLOPs is a primary lever for reducing the environmental impact and cloud compute bills of AI projects.

Limitations as a Performance Metric

FLOPs have significant limitations:

Ignores Activation Memory: Does not account for the cost of storing intermediate activations, which can be the true bottleneck.
No Parallelism Insight: A 100 GFLOP model with high parallelism may run faster than a 50 GFLOP sequential model.
Operation Mix: Treats all FLOPs equally, though some hardware (e.g., GPUs) execute multiplications and additions at different speeds. It is best used alongside metrics like parameter count, activation size, and measured latency.

Relation to Model Architecture

FLOPs are heavily influenced by architectural choices:

Dense Layers: FLOPs scale with the product of input and output neurons.
Convolutional Layers: FLOPs = (H * W * C_in * K_h * K_w * C_out) for output size HxW, kernel size K, and channels C.
Attention Layers: The self-attention mechanism in Transformers scales with O(n²d) for sequence length n and dimension d, making it a major FLOP contributor in LLMs. Architects trade these components to balance FLOPs, accuracy, and model capacity.

COMPUTATIONAL COST

How FLOPs are Calculated

FLOPs (Floating Point Operations) is a fundamental metric for quantifying the computational cost of a machine learning model's forward pass, representing the total count of floating-point arithmetic operations required.

FLOPs are calculated by analyzing a model's architecture to count every floating-point addition, multiplication, or division performed. For a fully connected layer, this is approximately 2 * I * O FLOPs, where I is the input dimension and O is the output dimension. For a convolutional layer, the formula is 2 * K_h * K_w * C_in * C_out * H_out * W_out, accounting for kernel dimensions, input/output channels, and output feature map size. Matrix multiplications and activation functions like GeLU also contribute significantly to the total count.

This theoretical count provides a hardware-agnostic measure of computational complexity, distinct from practical runtime or FLOPS (Floating Point Operations Per Second). It is a critical input for inference optimization, hardware selection, and estimating a model's carbon footprint. While useful for comparing architectures, FLOPs do not capture memory bandwidth constraints, parallelization efficiency, or the impact of specialized operations like layer normalization, which is why it is often used alongside latency benchmarking in a comprehensive evaluation suite.

COMPARISON

FLOPs vs. Related Performance Metrics

This table distinguishes FLOPs, a measure of theoretical computational cost, from other key metrics used to evaluate AI model performance, efficiency, and real-world deployment characteristics.

Metric	FLOPs (Floating Point Operations)	Inference Latency	Memory Footprint	Throughput
Primary Definition	Total count of floating-point arithmetic operations (add, multiply) for a single forward pass.	Time delay (e.g., milliseconds) from input submission to model output.	Amount of memory (RAM/VRAM) required to load and run the model.	Number of inference requests processed per unit time (e.g., requests/sec).
What It Measures	Theoretical computational workload or algorithmic complexity.	End-to-end user-perceived speed or responsiveness.	Hardware resource requirement for model storage and execution.	System capacity or batch processing efficiency.
Unit of Measurement	FLOPs (e.g., GigaFLOPs, TeraFLOPs).	Time (e.g., milliseconds, seconds).	Bytes (e.g., Megabytes, Gigabytes).	Inferences per second (IPS).
Key Determinants	Model architecture (layers, parameters), input size.	Hardware (CPU/GPU/NPU), software optimizations, model size, input size.	Number of parameters, precision (FP32, FP16, INT8), optimizer states.	Hardware parallelism, batch size, software optimizations, latency.
Use Case in Evaluation	Estimating training cost, comparing theoretical efficiency of architectures.	Benchmarking real-time application feasibility (e.g., APIs, edge devices).	Assessing hardware compatibility and deployment feasibility.	Measuring server-side scaling and cost-per-inference at scale.
Directly Proportional to Model Size?
Indicates Real-World Speed?
Critical for Edge/On-Device AI?
Common Optimization Target For	Model architecture research, neural architecture search (NAS).	Inference engines, kernel fusion, quantization.	Model compression, pruning, quantization.	Batching strategies, concurrent request handling.

COMPUTATIONAL BUDGETING

Practical Applications of FLOPs Analysis

FLOPs analysis is a foundational engineering practice for quantifying and optimizing the computational cost of machine learning models. It directly informs hardware selection, architectural design, and deployment strategy.

Hardware Selection & Procurement

FLOPs are the primary metric for sizing inference hardware. By calculating a model's FLOPs per forward pass, engineers can match it to hardware with sufficient FLOPS (Floating Point Operations Per Second) capacity to meet target latency. For example, a 100 GFLOP model requires hardware capable of at least 100 GFLOPS to run in one second. This analysis prevents costly over-provisioning and identifies bottlenecks, guiding decisions between GPUs, TPUs, and NPUs.

GFLOPS to TFLOPS

Typical Model Range

Model Architecture Search (NAS)

FLOPs serve as a key constraint in automated Neural Architecture Search. The search algorithm explores a vast space of potential model designs (e.g., different layer depths, widths, kernel sizes) and uses FLOPs as a proxy for computational efficiency. This allows for the discovery of Pareto-optimal architectures that balance high accuracy with low computational cost, crucial for deploying models on edge devices or under strict inference budget constraints.

Cost Estimation for Training & Inference

FLOPs enable precise forecasting of cloud compute expenses. The total FLOPs for a training run is calculated as: Total Training FLOPs = (FLOPs per forward/backward pass) * (Number of training samples) * (Number of epochs) This figure, combined with the FLOPS/$ rate of cloud instances, provides a direct cost estimate. For inference, FLOPs per query multiplied by expected query volume predicts sustained infrastructure costs, forming the basis for Service Level Objective (SLO) planning and Total Cost of Ownership (TCO) models.

Comparative Model Benchmarking

FLOPs provide a standardized, hardware-agnostic basis for comparing the efficiency of different model architectures. When evaluating models with similar accuracy, the one with lower FLOPs is inherently more efficient. This is critical for leaderboards and research, moving beyond accuracy-only metrics to consider the computational cost of performance. For instance, comparing a Vision Transformer (ViT) to a ConvNeXt model requires analyzing both Top-1 accuracy and GFLOPs to understand the performance/cost trade-off.

Informing Model Compression Techniques

FLOPs analysis identifies the most computationally expensive layers (e.g., large fully-connected or attention layers), guiding targeted application of compression methods:

Pruning: Removing low-weight neurons/channels from high-FLOP layers yields the greatest reduction.
Quantization: Replacing 32-bit floating point (FP32) operations with 8-bit integer (INT8) operations directly reduces the operational cost, though FLOPs count may remain similar.
Knowledge Distillation: Using FLOPs to size the student model appropriately relative to the teacher. The goal is to maximize FLOPs reduction with minimal accuracy loss.

Environmental Impact & Carbon Accounting

FLOPs are a direct input for estimating the carbon footprint of AI. Energy consumption of hardware is strongly correlated with FLOPs executed. By tracking total FLOPs consumed during training (often in petaFLOPs-day or exaFLOPs), organizations can use region-specific carbon intensity of electricity (gCO2eq/kWh) to estimate emissions. This drives the development of green AI practices, incentivizing efficient architectures and scheduling training in regions with cleaner energy grids.

EXPLORE

FLOPs (FLOATING POINT OPERATIONS)

Frequently Asked Questions

FLOPs (Floating Point Operations) is a fundamental metric for quantifying the computational cost of machine learning models. This FAQ addresses common questions about its calculation, interpretation, and role in model benchmarking and deployment.

FLOPs (Floating Point Operations) are a count of the total number of floating-point arithmetic operations—primarily additions, multiplications, and divisions—required for a single forward pass of a neural network model. Calculation involves analyzing the model's architecture: for a fully connected layer, FLOPs are approximately 2 * I * O, where I is the input dimension and O is the output dimension (accounting for multiply-add as one operation). For a convolutional layer with kernel size K, input channels C_in, output channels C_out, and output feature map size H_out * W_out, FLOPs are roughly 2 * K * K * C_in * C_out * H_out * W_out. Tools like ptflops for PyTorch or tf.profiler for TensorFlow automate this profiling. It is crucial to distinguish FLOPs from FLOPS (Floating Point Operations Per Second), which is a measure of hardware throughput.

Key Calculation Points:

Counts operations for one input sample (batch size 1).
Often reported as total FLOPs or GFLOPs (billions of FLOPs).
Excludes memory access costs, activation functions (e.g., ReLU), and normalization layers, which have negligible FLOPs but impact real runtime.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

COMPUTATIONAL METRICS

Related Terms

FLOPs are a fundamental unit for measuring computational cost, but they exist within a broader ecosystem of metrics that define model efficiency, hardware performance, and operational economics.

Parameters

Parameters are the learnable weights and biases within a neural network that are adjusted during training. They represent the model's "memory" and directly influence its capacity and the number of FLOPs required for inference.

Relationship to FLOPs: The total parameter count and the architecture's connectivity pattern (e.g., dense vs. sparse) are primary determinants of the FLOPs calculation. More parameters generally, but not always, lead to higher FLOPs.
Example: A model with 7 billion parameters will have a fundamentally different FLOP profile than one with 70 million parameters, even for the same input size.

Inference Latency

Inference Latency is the real-world time delay, measured in milliseconds, between submitting an input to a trained model and receiving its output. While FLOPs measure theoretical computational cost, latency is the experienced performance.

Key Drivers: Latency is determined by FLOPs, memory bandwidth (how quickly weights can be accessed), hardware parallelism (e.g., GPU cores), and software optimization.
Critical Metric: For production systems, P95 or P99 latency percentiles are often more critical service-level objectives (SLOs) than raw FLOPs, as they directly impact user experience.

Throughput

Throughput measures the number of inferences a system can process per second (e.g., inferences/sec). It is the inverse of latency under optimal batching conditions and is crucial for high-volume serving.

FLOPs Efficiency: High-throughput systems aim to maximize the utilization of available FLOP/s (FLOPS per second) on the hardware. Techniques like continuous batching dynamically group requests to keep computational units saturated.
Economic Impact: Throughput directly correlates with the cost-per-inference, making it a key metric for infrastructure cost control alongside FLOPs.

MACs (Multiply-Accumulate Operations)

A Multiply-Accumulate (MAC) operation is a fundamental compute primitive that performs a = a + (b * c). It is the core operation in matrix multiplications and convolutions.

Relationship to FLOPs: One MAC is often counted as two FLOPs (one multiplication and one addition). However, some literature uses MACs and FLOPs interchangeably, so context is essential.
Hardware Design: Modern AI accelerators (TPUs, NPUs) are often benchmarked on their peak MAC/s or TOPS (Tera Operations Per Second) capacity, which aligns directly with FLOP-based model analysis.

Model Compression

Model Compression is a suite of techniques aimed at reducing a model's computational footprint (FLOPs) and memory requirements (parameters) for efficient deployment.

Key Techniques:
- Pruning: Removing insignificant weights or neurons, reducing active FLOPs.
- Quantization: Representing weights and activations with lower precision (e.g., from 32-bit to 8-bit), which reduces memory bandwidth and can enable faster low-precision FLOPs on supported hardware.
- Knowledge Distillation: Training a smaller "student" model to mimic a larger "teacher," achieving lower FLOPs with minimal performance loss.

FLOPS (FLOPS per Second)

FLOPS (Floating Point Operations Per Second) is a measure of a hardware system's computational throughput, not to be confused with FLOPs (the count of operations).

Hardware Benchmark: It quantifies the peak theoretical performance of a CPU, GPU, or AI accelerator (e.g., 100 TFLOPS = 100 trillion FLOPs per second).
Practical Performance: The actual achieved FLOPS when running a model is its compute utilization. A model requiring 1 TFLOP (the count) running on a 100 TFLOPS chip in 0.02 seconds achieves 50 TFLOPS of utilization (50% efficiency).

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.