Glossary

MLPerf Tiny

MLPerf Tiny is a standardized benchmark suite from the MLPerf consortium designed to measure the performance and accuracy of machine learning systems on ultra-low-power microcontrollers and edge devices.

Get in touch Learn more

Engineer deploying small language model to edge device, IoT sensor visible on desk, technical hardware setup in bright workspace.

TINYML BENCHMARK

What is MLPerf Tiny?

MLPerf Tiny is the definitive, vendor-neutral benchmark suite for evaluating the performance and accuracy of machine learning systems on ultra-low-power microcontrollers and other deeply embedded devices.

MLPerf Tiny is a specialized benchmark suite from the MLPerf consortium designed to provide standardized, reproducible metrics for TinyML systems. It measures key performance indicators like inference latency, energy consumption, and model accuracy across a set of common embedded tasks, such as keyword spotting and visual wake words. This allows engineers to make objective comparisons between different hardware platforms, software frameworks, and neural network architectures under identical conditions.

The benchmark focuses on microcontroller-class devices with severe constraints on memory, compute, and power. By establishing a common evaluation ground, MLPerf Tiny drives innovation in model efficiency, hardware-aware optimization, and inference engine design. It is a critical tool for embedded developers and silicon vendors to validate and demonstrate the real-world capabilities of their TinyML solutions for applications in IoT, wearables, and smart sensors.

BENCHMARK SUITE

Key Characteristics of MLPerf Tiny

MLPerf Tiny is a standardized benchmark suite from the MLPerf consortium designed to measure and compare the performance, accuracy, and efficiency of machine learning inference on ultra-low-power microcontrollers (MCUs).

Focus on Microcontroller-Class Devices

MLPerf Tiny is explicitly designed for microcontroller units (MCUs) and other deeply embedded processors, typically characterized by:

Severe memory constraints (often < 1 MB of SRAM/Flash)
Extremely low power budgets (milliwatt-scale operation)
Limited compute (single-core Arm Cortex-M class CPUs, often without an FPU)
Lack of an OS or running a minimal RTOS This distinguishes it from other MLPerf benchmarks (like Mobile or Datacenter) which target smartphones, laptops, or servers with orders of magnitude more resources.

Standardized Benchmark Tasks

The suite comprises a small set of representative TinyML tasks chosen for their real-world relevance and diversity of computational patterns. The current v1.1 benchmarks are:

Keyword Spotting (KWS): Identify spoken commands from audio.
Visual Wake Words (VWW): Detect the presence of a person in an image.
Image Classification (IC): Classify images from the CIFAR-10 dataset.
Anomaly Detection (AD): Identify anomalous machine sounds from audio. Each task provides a standardized dataset, a reference model, and a precise accuracy target that must be met for a valid submission, ensuring fair comparison.

Multi-Dimensional Metrics

Performance is measured across several critical axes for embedded systems, not just raw speed:

Latency: The time to perform a single inference, measured in milliseconds.
Throughput: The number of inferences processed per second.
Energy: The total joules consumed per inference, a key metric for battery-powered devices.
Peak Memory Usage: The maximum SRAM (temporary) and Flash (persistent) memory consumed by the model and runtime.
Accuracy: The model's task performance, which must meet the benchmark's minimum threshold. Results are presented in a results table that allows engineers to trade off these dimensions based on their application's needs (e.g., lowest energy vs. highest accuracy).

Strict Submission Rules & Auditing

To ensure credibility and prevent unfair optimization, MLPerf Tiny enforces rigorous submission rules:

Closed Division: Submissions must use the benchmark's official datasets and models; no architectural changes or extra training data are allowed. This tests deployment efficiency.
Open Division: Allows model architecture changes and retraining, fostering innovation in model design for constrained hardware.
Required Measurements: Latency and energy must be measured on physical hardware, not simulated.
Auditability: All submissions include detailed configuration files, code, and measurement methodologies that are reviewed by the MLPerf organization.

Hardware and Software Agnosticism

The benchmark is platform-agnostic, enabling fair competition across diverse hardware and software stacks:

Hardware: Supports any MCU, SoC, or accelerator (e.g., Arm Cortex-M, RISC-V, Ethos-U55 NPU).
Frameworks: Compatible with any TinyML inference engine (e.g., TensorFlow Lite Micro, CMSIS-NN, proprietary vendor SDKs).
Reference Implementations: Provides a baseline implementation using TensorFlow Lite for Microcontrollers to lower the entry barrier. This agnosticism drives innovation across the entire TinyML ecosystem, from silicon vendors to compiler developers.

Driving Ecosystem Development

Beyond mere measurement, MLPerf Tiny serves as a catalyst and reference point for the TinyML industry:

Vendor Benchmarking: Chipmakers (ST, NXP, Renesas, etc.) and IP providers (Arm) use it to showcase hardware capabilities.
Toolchain Validation: Framework developers (TF Lite Micro, TVM) use it to verify optimization passes and compiler correctness.
Research Benchmark: Academics and researchers use it as a standard testbed for new model compression, neural architecture search (NAS), and efficient kernel techniques.
Purchasing Guidance: Provides CTOs and engineers with objective, audited data for hardware and software selection.

TINYML FRAMEWORKS

How MLPerf Tiny Benchmarking Works

MLPerf Tiny is the definitive benchmark suite for evaluating machine learning performance on microcontrollers and other ultra-low-power devices.

MLPerf Tiny is a standardized benchmark suite from the MLPerf consortium designed to measure the inference latency, accuracy, and energy efficiency of machine learning systems on microcontrollers. It provides a rigorous, vendor-neutral methodology for comparing TinyML frameworks, hardware accelerators, and model optimizations across four representative tasks: keyword spotting, visual wake words, image classification, and anomaly detection.

The benchmark enforces strict submission rules requiring results from physical hardware, not simulation, ensuring real-world relevance. It measures power consumption in microjoules per inference and peak memory usage, which are critical constraints for battery-operated devices. By providing these standardized metrics, MLPerf Tiny drives innovation in model compression, neural architecture search, and efficient kernel libraries for the embedded AI ecosystem.

BENCHMARK SUITE

MLPerf Tiny Benchmark Tasks

MLPerf Tiny is a standardized benchmark suite designed to measure the performance and accuracy of machine learning systems on ultra-low-power microcontrollers. Its tasks represent real-world, compute-intensive workloads for embedded AI.

Keyword Spotting (KWS)

The Keyword Spotting task involves detecting specific spoken words (e.g., 'yes', 'no', 'up', 'down') from a continuous audio stream. This is a foundational capability for voice-controlled devices.

Dataset: Uses the Google Speech Commands v2 dataset.
Model: Typically a Depthwise Separable Convolutional Neural Network (DS-CNN).
Challenge: Requires real-time processing of audio frames with low latency and high accuracy, despite background noise and varying speakers.

EXPLORE

Visual Wake Words (VWW)

The Visual Wake Words task is a binary image classification problem: determining whether a person is present in a low-resolution image. It's critical for privacy-preserving, always-on camera applications.

Dataset: Derived from the COCO dataset, cropped to 96x96 pixel RGB images.
Model: Often a MobileNetV1 or similar efficient convolutional architecture.
Challenge: Maximizing accuracy within a severe sub-250KB model memory budget, balancing false positives and negatives.

EXPLORE

Image Classification (IC)

The Image Classification task requires a model to classify an image into one of many categories, representing a more computationally intensive vision workload than VWW.

Dataset: Uses the CIFAR-10 dataset (32x32 pixel images across 10 classes).
Model: Variants of ResNet or MobileNet are common.
Challenge: Achieving high classification accuracy on a complex dataset while operating within the strict memory and compute limits of a microcontroller, often requiring aggressive quantization.

EXPLORE

Anomaly Detection (AD)

The Anomaly Detection task involves identifying irregular patterns in sensor data, which is essential for predictive maintenance in industrial IoT.

Dataset: Uses the ToyADMOS2 dataset, containing normal and anomalous machine audio samples.
Model: Often an Autoencoder that learns to reconstruct normal data; high reconstruction error indicates an anomaly.
Challenge: Operating on raw, 1D audio data with a model small enough for an MCU, and defining a robust threshold for anomaly scoring in unpredictable environments.

EXPLORE

Benchmarking Metrics

MLPerf Tiny measures systems across multiple, equally important axes to provide a holistic view of TinyML performance.

Accuracy: Primary metric (e.g., top-1 accuracy for IC, F1 score for AD). The benchmark defines minimum accuracy targets.
Latency: Time to perform a single inference, critical for real-time responsiveness.
Energy: Total joules consumed per inference, measured directly on the hardware under test.
Memory Footprint: Model size (ROM) and peak RAM usage for activations. These hard constraints define what is deployable.

Reference Implementations & Divisions

To ensure fair and reproducible comparisons, MLPerf Tiny provides open-source reference models and defines submission categories.

Reference Models: Fully functional, pretrained models for each task, establishing a performance baseline.
Closed Division: Submitters must use the benchmark's reference model and dataset, focusing optimization on the inference engine and system software.
Open Division: Allows modifications to the model architecture (e.g., neural architecture search) and training process, encouraging algorithmic innovation.
Available Submissions: All results are publicly published, fostering transparency and driving progress in the field.

EXPLORE

COMPARISON

MLPerf Tiny vs. Other ML Benchmarks

This table contrasts the focus, scope, and technical characteristics of MLPerf Tiny against other prominent machine learning benchmark suites.

Feature / Metric	MLPerf Tiny	MLPerf Inference (Datacenter/Edge)	EEMBC MLMark	AI-Benchmark (Lite)
Primary Target Hardware	Microcontrollers (MCUs)	Servers, Edge AI accelerators, High-end SoCs	Microcontrollers & Low-power SoCs	Mobile & Embedded SoCs (Android)
Typical Power Envelope	< 50 mW	10 W	< 1 W	1-5 W
Memory Constraint Focus	SRAM (< 512 KB)	DRAM (GBs)	SRAM/Flash (KB-MB)	RAM (GBs)
Benchmark Suite Scope	Closed-Division, Prescribed Models	Closed & Open Divisions, Multiple Scenarios	Closed-Division, Prescribed Models	Closed-Division, Prescribed Models
Key Measured Metrics	Accuracy, Latency, Energy	Throughput, Latency, Accuracy	Inference Time, Energy	Inference Time, Accuracy
Standardized Workloads	Keyword Spotting, Visual Wake Words, Anomaly Detection	Image Classification, Object Detection, NLP, Recommendation	Image Classification, Keyword Spotting	Image Classification, NLP, Face Recognition
Submission Requirements	Full system reproducibility (code, build, run)	Detailed system description, reproducible results	Results submission to EEMBC portal	Mobile app execution & score upload
Industry Consortium Backing	MLPerf (MLCommons)	MLPerf (MLCommons)	EEMBC	Independent (ETH Zurich)
Primary Audience	MCU Vendors, Embedded ML Researchers	Cloud/Edge HW Vendors, Datacenter Operators	MCU/Silicon Vendors, OEMs	Mobile SoC Vendors, App Developers

MLPERF TINY

Frequently Asked Questions

MLPerf Tiny is the definitive benchmark suite for evaluating machine learning performance on microcontrollers and other ultra-low-power devices. These FAQs address its purpose, structure, and role in the TinyML ecosystem.

MLPerf Tiny is a benchmark suite from the MLPerf consortium designed to measure the performance and accuracy of machine learning inference systems on ultra-low-power devices like microcontrollers. It provides standardized, reproducible metrics for comparing TinyML solutions across different hardware, software, and model optimizations. The benchmark focuses on four key tasks representative of real-world edge applications: Keyword Spotting (KWS), Visual Wake Words (VWW), Image Classification (IC), and Anomaly Detection (AD). Each task is defined by a reference model, dataset, and quality target, ensuring fair comparisons. Submissions are measured on metrics including latency, energy consumption, and model accuracy, providing a holistic view of system efficiency for developers and hardware vendors.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

MLPerf Tiny

What is MLPerf Tiny?