MLPerf Tiny is a specialized benchmark suite from the MLPerf consortium designed to provide standardized, reproducible metrics for TinyML systems. It measures key performance indicators like inference latency, energy consumption, and model accuracy across a set of common embedded tasks, such as keyword spotting and visual wake words. This allows engineers to make objective comparisons between different hardware platforms, software frameworks, and neural network architectures under identical conditions.
Glossary
MLPerf Tiny

What is MLPerf Tiny?
MLPerf Tiny is the definitive, vendor-neutral benchmark suite for evaluating the performance and accuracy of machine learning systems on ultra-low-power microcontrollers and other deeply embedded devices.
The benchmark focuses on microcontroller-class devices with severe constraints on memory, compute, and power. By establishing a common evaluation ground, MLPerf Tiny drives innovation in model efficiency, hardware-aware optimization, and inference engine design. It is a critical tool for embedded developers and silicon vendors to validate and demonstrate the real-world capabilities of their TinyML solutions for applications in IoT, wearables, and smart sensors.
Key Characteristics of MLPerf Tiny
MLPerf Tiny is a standardized benchmark suite from the MLPerf consortium designed to measure and compare the performance, accuracy, and efficiency of machine learning inference on ultra-low-power microcontrollers (MCUs).
Focus on Microcontroller-Class Devices
MLPerf Tiny is explicitly designed for microcontroller units (MCUs) and other deeply embedded processors, typically characterized by:
- Severe memory constraints (often < 1 MB of SRAM/Flash)
- Extremely low power budgets (milliwatt-scale operation)
- Limited compute (single-core Arm Cortex-M class CPUs, often without an FPU)
- Lack of an OS or running a minimal RTOS This distinguishes it from other MLPerf benchmarks (like Mobile or Datacenter) which target smartphones, laptops, or servers with orders of magnitude more resources.
Standardized Benchmark Tasks
The suite comprises a small set of representative TinyML tasks chosen for their real-world relevance and diversity of computational patterns. The current v1.1 benchmarks are:
- Keyword Spotting (KWS): Identify spoken commands from audio.
- Visual Wake Words (VWW): Detect the presence of a person in an image.
- Image Classification (IC): Classify images from the CIFAR-10 dataset.
- Anomaly Detection (AD): Identify anomalous machine sounds from audio. Each task provides a standardized dataset, a reference model, and a precise accuracy target that must be met for a valid submission, ensuring fair comparison.
Multi-Dimensional Metrics
Performance is measured across several critical axes for embedded systems, not just raw speed:
- Latency: The time to perform a single inference, measured in milliseconds.
- Throughput: The number of inferences processed per second.
- Energy: The total joules consumed per inference, a key metric for battery-powered devices.
- Peak Memory Usage: The maximum SRAM (temporary) and Flash (persistent) memory consumed by the model and runtime.
- Accuracy: The model's task performance, which must meet the benchmark's minimum threshold. Results are presented in a results table that allows engineers to trade off these dimensions based on their application's needs (e.g., lowest energy vs. highest accuracy).
Strict Submission Rules & Auditing
To ensure credibility and prevent unfair optimization, MLPerf Tiny enforces rigorous submission rules:
- Closed Division: Submissions must use the benchmark's official datasets and models; no architectural changes or extra training data are allowed. This tests deployment efficiency.
- Open Division: Allows model architecture changes and retraining, fostering innovation in model design for constrained hardware.
- Required Measurements: Latency and energy must be measured on physical hardware, not simulated.
- Auditability: All submissions include detailed configuration files, code, and measurement methodologies that are reviewed by the MLPerf organization.
Hardware and Software Agnosticism
The benchmark is platform-agnostic, enabling fair competition across diverse hardware and software stacks:
- Hardware: Supports any MCU, SoC, or accelerator (e.g., Arm Cortex-M, RISC-V, Ethos-U55 NPU).
- Frameworks: Compatible with any TinyML inference engine (e.g., TensorFlow Lite Micro, CMSIS-NN, proprietary vendor SDKs).
- Reference Implementations: Provides a baseline implementation using TensorFlow Lite for Microcontrollers to lower the entry barrier. This agnosticism drives innovation across the entire TinyML ecosystem, from silicon vendors to compiler developers.
Driving Ecosystem Development
Beyond mere measurement, MLPerf Tiny serves as a catalyst and reference point for the TinyML industry:
- Vendor Benchmarking: Chipmakers (ST, NXP, Renesas, etc.) and IP providers (Arm) use it to showcase hardware capabilities.
- Toolchain Validation: Framework developers (TF Lite Micro, TVM) use it to verify optimization passes and compiler correctness.
- Research Benchmark: Academics and researchers use it as a standard testbed for new model compression, neural architecture search (NAS), and efficient kernel techniques.
- Purchasing Guidance: Provides CTOs and engineers with objective, audited data for hardware and software selection.
How MLPerf Tiny Benchmarking Works
MLPerf Tiny is the definitive benchmark suite for evaluating machine learning performance on microcontrollers and other ultra-low-power devices.
MLPerf Tiny is a standardized benchmark suite from the MLPerf consortium designed to measure the inference latency, accuracy, and energy efficiency of machine learning systems on microcontrollers. It provides a rigorous, vendor-neutral methodology for comparing TinyML frameworks, hardware accelerators, and model optimizations across four representative tasks: keyword spotting, visual wake words, image classification, and anomaly detection.
The benchmark enforces strict submission rules requiring results from physical hardware, not simulation, ensuring real-world relevance. It measures power consumption in microjoules per inference and peak memory usage, which are critical constraints for battery-operated devices. By providing these standardized metrics, MLPerf Tiny drives innovation in model compression, neural architecture search, and efficient kernel libraries for the embedded AI ecosystem.
MLPerf Tiny Benchmark Tasks
MLPerf Tiny is a standardized benchmark suite designed to measure the performance and accuracy of machine learning systems on ultra-low-power microcontrollers. Its tasks represent real-world, compute-intensive workloads for embedded AI.
Benchmarking Metrics
MLPerf Tiny measures systems across multiple, equally important axes to provide a holistic view of TinyML performance.
- Accuracy: Primary metric (e.g., top-1 accuracy for IC, F1 score for AD). The benchmark defines minimum accuracy targets.
- Latency: Time to perform a single inference, critical for real-time responsiveness.
- Energy: Total joules consumed per inference, measured directly on the hardware under test.
- Memory Footprint: Model size (ROM) and peak RAM usage for activations. These hard constraints define what is deployable.
MLPerf Tiny vs. Other ML Benchmarks
This table contrasts the focus, scope, and technical characteristics of MLPerf Tiny against other prominent machine learning benchmark suites.
| Feature / Metric | MLPerf Tiny | MLPerf Inference (Datacenter/Edge) | EEMBC MLMark | AI-Benchmark (Lite) |
|---|---|---|---|---|
Primary Target Hardware | Microcontrollers (MCUs) | Servers, Edge AI accelerators, High-end SoCs | Microcontrollers & Low-power SoCs | Mobile & Embedded SoCs (Android) |
Typical Power Envelope | < 50 mW |
| < 1 W | 1-5 W |
Memory Constraint Focus | SRAM (< 512 KB) | DRAM (GBs) | SRAM/Flash (KB-MB) | RAM (GBs) |
Benchmark Suite Scope | Closed-Division, Prescribed Models | Closed & Open Divisions, Multiple Scenarios | Closed-Division, Prescribed Models | Closed-Division, Prescribed Models |
Key Measured Metrics | Accuracy, Latency, Energy | Throughput, Latency, Accuracy | Inference Time, Energy | Inference Time, Accuracy |
Standardized Workloads | Keyword Spotting, Visual Wake Words, Anomaly Detection | Image Classification, Object Detection, NLP, Recommendation | Image Classification, Keyword Spotting | Image Classification, NLP, Face Recognition |
Submission Requirements | Full system reproducibility (code, build, run) | Detailed system description, reproducible results | Results submission to EEMBC portal | Mobile app execution & score upload |
Industry Consortium Backing | MLPerf (MLCommons) | MLPerf (MLCommons) | EEMBC | Independent (ETH Zurich) |
Primary Audience | MCU Vendors, Embedded ML Researchers | Cloud/Edge HW Vendors, Datacenter Operators | MCU/Silicon Vendors, OEMs | Mobile SoC Vendors, App Developers |
Frequently Asked Questions
MLPerf Tiny is the definitive benchmark suite for evaluating machine learning performance on microcontrollers and other ultra-low-power devices. These FAQs address its purpose, structure, and role in the TinyML ecosystem.
MLPerf Tiny is a benchmark suite from the MLPerf consortium designed to measure the performance and accuracy of machine learning inference systems on ultra-low-power devices like microcontrollers. It provides standardized, reproducible metrics for comparing TinyML solutions across different hardware, software, and model optimizations. The benchmark focuses on four key tasks representative of real-world edge applications: Keyword Spotting (KWS), Visual Wake Words (VWW), Image Classification (IC), and Anomaly Detection (AD). Each task is defined by a reference model, dataset, and quality target, ensuring fair comparisons. Submissions are measured on metrics including latency, energy consumption, and model accuracy, providing a holistic view of system efficiency for developers and hardware vendors.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
MLPerf Tiny exists within a specialized ecosystem of software, hardware, and methodologies designed for ultra-low-power machine learning. These related concepts define the tools and techniques used to achieve benchmarked performance.
MCUNet (TinyNAS & TinyEngine)
A system co-design framework that jointly optimizes the neural network architecture (TinyNAS) and the inference runtime (TinyEngine) to enable ImageNet-scale classification on microcontrollers with under 512KB of memory. It represents the state-of-the-art in hardware-aware model and runtime co-optimization.
- TinyNAS searches for models that fit within a device's SRAM/Flash budget.
- TinyEngine generates specialized, ultra-lean C code via operator fusion and in-place depthwise convolution to minimize memory overhead.
AI Coprocessor / microNPU
A dedicated hardware accelerator, such as the Arm Ethos-U55 or U65, integrated into a microcontroller or system-on-chip to offload and dramatically accelerate neural network inference tasks. MLPerf Tiny benchmarks performance on systems using these accelerators.
- Requires a vendor-specific NPU SDK for model compilation and deployment.
- Executes quantized (int8) models with extreme power efficiency.
- Works alongside a main Cortex-M CPU, which handles control logic and pre/post-processing.
TinyML Deployment Workflow
The end-to-end process of converting a trained model into firmware running on a microcontroller, which is the ultimate goal measured by MLPerf Tiny. This involves a specialized TinyML toolchain.
- Model Conversion & Optimization: Using a micro-compiler (e.g., TVM, nncase) for graph optimization and quantization.
- Code Generation: Outputting a C array model or linked library.
- Integration & Profiling: Embedding the model into firmware, managing the tensor arena, and validating latency/accuracy on real hardware.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us