Glossary

Ethos-U55

The Arm Ethos-U55 is a configurable, area- and power-efficient microNPU (Neural Processing Unit) accelerator for machine learning inference in embedded and IoT endpoint devices.

Get in touch Learn more

Engineer deploying small language model to edge device, IoT sensor visible on desk, technical hardware setup in bright workspace.

MICRO NPU

What is Ethos-U55?

The Arm Ethos-U55 is a micro Neural Processing Unit (microNPU) designed as a configurable, area- and power-efficient accelerator for machine learning inference in embedded and IoT endpoint devices using Cortex-M CPUs.

The Arm Ethos-U55 is a microNPU (Neural Processing Unit) accelerator designed to offload and dramatically accelerate machine learning inference workloads from the main Cortex-M CPU in deeply embedded and IoT devices. It is a configurable, licensable intellectual property (IP) block that silicon vendors integrate into their system-on-chip (SoC) designs to deliver high performance per watt for convolutional neural networks within strict area and power budgets.

Deploying models to the Ethos-U55 requires a vendor-specific NPU SDK and compiler toolchain that translates standard neural network formats into optimized, executable code for the accelerator. This hardware-software co-design is critical for TinyML deployment, enabling complex vision and audio models to run in real-time on microcontrollers where software-only execution would be infeasible due to memory and latency constraints.

ARM ETHOS-U55

Key Architectural Features

The Arm Ethos-U55 is a microNPU (Neural Processing Unit) designed as a configurable, area- and power-efficient accelerator for machine learning inference in embedded and IoT endpoint devices using Cortex-M CPUs.

MicroNPU Core Architecture

The Ethos-U55 is a dedicated neural processing unit architected from the ground up for constrained devices. Its core is a single-instruction, multiple-data (SIMD) engine with a tensor array of processing elements (PEs). Key features include:

Configurable PE Count: The number of PEs is scalable (e.g., 32, 64, 128, 256) to balance performance, silicon area, and power for the target application.
Weight Streaming: The architecture employs a weight decoder that streams compressed weights directly from external memory (e.g., Flash) into the PE array, minimizing the need for large on-chip SRAM to hold the entire model.
Activation Caching: Intermediate activation tensors are managed via a small, efficient system cache, reducing costly main memory accesses.

Memory System & Dataflow

Efficient data movement is critical for low-power operation. The Ethos-U55 uses a hierarchical memory system and a weight-stationary dataflow.

Hierarchical Memory: Comprises small, fast buffers close to the PEs, a larger shared SRAM, and interfaces to external Flash and RAM. This minimizes energy-intensive accesses to main memory.
Weight-Stationary Flow: Weights are loaded into the PE array once and held stationary, while activation data streams through. This maximizes data reuse and reduces power consumption compared to activation-stationary or output-stationary flows.
Direct Memory Access (DMA): Dedicated DMA engines manage data transfers between memory hierarchies, operating in parallel with computation to hide latency.

Supported Operations & Precision

The accelerator is optimized for the most common layers and data types used in efficient neural networks for edge sensing.

Layer Support: Natively accelerates convolutions, fully connected layers, transpose convolutions, pooling (max/average), and element-wise operations (add, mul, etc.).
Data Types: Primarily optimized for int8 (8-bit integer) and int16 arithmetic, which provide a favorable accuracy-to-efficiency trade-off for microcontroller deployments. Support for asymmetric quantization is integral.
Activation Functions: Hardware-accelerated support for common non-linearities like ReLU, ReLU6, and sigmoid/tanh (via lookup tables).

Cortex-M System Integration

The Ethos-U55 is designed as a coherent accelerator tightly coupled with an Arm Cortex-M CPU, typically via an AMBA AXI system bus.

Programmer's Model: The Cortex-M core acts as the host processor, controlling the U55 via a simple register interface. It offloads entire subgraphs or layers for execution.
Shared Memory: The U55 operates on tensors located in system SRAM accessible to the CPU, enabling zero-copy data passing. Coherency is managed by the system.
Interrupt-Driven: The U55 signals task completion via an interrupt, allowing the CPU to sleep or perform other tasks during inference, maximizing system power efficiency.

Toolchain: Vela Compiler

Deploying a model to the Ethos-U55 requires the Arm Vela compiler. This is a target-specific optimizer and compiler that translates a TensorFlow Lite model into optimized command streams.

Graph Optimization: Performs hardware-aware optimizations like operator fusion, buffer lifetime optimization, and scheduler instruction generation.
Weight Encoding: Applies lossless compression (e.g., Huffman encoding) to the model weights to reduce Flash footprint and bandwidth.
Performance Estimation: Provides detailed memory usage and cycle count estimates for the compiled network, enabling design space exploration before hardware tape-out.

EXPLORE

Power & Performance Profiles

The architecture is designed for the extreme efficiency required in always-on, battery-powered endpoints.

Performance: Delivers inference speeds ranging from tens to hundreds of GOPS (Giga Operations Per Second), depending on configuration and clock frequency, enabling real-time audio and vision tasks.
Power Efficiency: Achieves TOPS/W (Tera Operations Per Second per Watt) figures significantly higher than CPU-only execution, often in the range of 2-5 TOPS/W.
Area Efficiency: As a microNPU, it occupies a small silicon footprint (e.g., ~0.5 mm² for a 128-PE configuration in a 22nm process), making it viable for cost-sensitive MCU/SoC integration.

ETHOS-U55

How It Works: System Integration & Dataflow

The Ethos-U55 microNPU operates as a tightly-coupled coprocessor to an Arm Cortex-M host CPU, offloading neural network inference workloads. It connects via a dedicated AXI-stream interface to the system bus, enabling direct access to model weights and input data stored in system memory. The host CPU initiates inference by writing control registers, triggering the U55 to autonomously fetch and execute the compiled neural network graph. This hardware acceleration dramatically reduces CPU load and system power consumption compared to software-only execution on the Cortex-M core.

Data flows through a dedicated SRAM tensor buffer within the U55, minimizing costly external memory accesses. The accelerator's systolic array architecture performs the dense matrix multiplications central to convolutional and fully connected layers with extreme efficiency. Post-processing activation functions like ReLU are applied on-the-fly. The host is notified via interrupt upon completion, at which point the final output tensors are available in main memory. This streamlined pipelined dataflow is managed entirely by the U55's internal sequencer, requiring minimal CPU oversight after job submission.

ETHOS-U55

Primary Use Cases & Applications

The Arm Ethos-U55 microNPU is designed to bring efficient machine learning inference to deeply embedded systems. Its primary applications are in power- and area-constrained endpoint devices where local intelligence is critical.

Keyword Spotting & Voice Control

Enables always-on, low-power voice interfaces in consumer and industrial devices. The Ethos-U55 offloads the computationally intensive neural network inference from the main Cortex-M CPU, allowing for complex models like Keyword Spotting (KWS) and Automatic Speech Recognition (ASR) to run locally with minimal battery drain.

Example: Wake-word detection in smart home assistants, voice commands for industrial tools.
Benefit: Eliminates the latency and privacy concerns of cloud-based processing.

EXPLORE

Anomaly Detection in Industrial IoT

Processes sensor data streams (vibration, acoustic, current) in real-time to predict mechanical failures. The microNPU executes time-series classification or regression models directly on the sensor node, enabling predictive maintenance.

Example: Monitoring pump vibrations for bearing wear, detecting electrical anomalies in motors.
Benefit: Reduces unplanned downtime by enabling condition-based maintenance without constant cloud connectivity.

Visual Wake Words & People Detection

Brings computer vision to battery-powered cameras and sensors. The Ethos-U55 accelerates lightweight convolutional neural networks (CNNs) for tasks like person detection, object counting, or simple classification.

Example: A security camera that activates recording only when a person is detected, conserving energy and bandwidth.
Benefit: Enables privacy-preserving, reactive systems that filter data at the source.

Predictive Sensor Fusion

Combines data from multiple sensors (IMU, environmental) using small neural networks to infer complex states. The Ethos-U55's efficiency allows for running multi-input models that would be too heavy for a Cortex-M core alone.

Example: In a wearable, fusing accelerometer and gyroscope data for advanced activity recognition or fall detection.
Benefit: Creates more context-aware and intelligent sensing applications from simple sensor hardware.

Low-Power Biometric Authentication

Enables on-device biometric verification for enhanced security. The accelerator can run models for fingerprint matching or face recognition locally, ensuring biometric data never leaves the device.

Example: Access control systems, secure payment terminals, or personal lockboxes.
Benefit: Provides a high-security, low-latency authentication path that is resilient to network outages.

Smart Health & Wearable Monitoring

Processes physiological signals for real-time health insights. The Ethos-U55 can run models for heart rate variability analysis, sleep stage classification, or abnormal ECG detection directly on a wearable device.

Example: A smartwatch that provides real-time atrial fibrillation detection without sending raw ECG data to a phone.
Benefit: Maximizes user privacy and enables immediate, life-critical alerts.

PERFORMANCE COMPARISON

Ethos-U55 vs. Software-Only Inference on Cortex-M

This table quantifies the performance, efficiency, and system impact of using the Arm Ethos-U55 microNPU accelerator versus executing neural network inference solely on the CPU cores of a Cortex-M microcontroller.

Metric / Feature	Software-Only on Cortex-M	Ethos-U55 Accelerated
Peak INT8 Throughput	~2-5 GOPS (varies by core)	Up to 128 GOPS (configurable)
Typical Inference Speedup	1x (Baseline)	10x to 50x
Power Efficiency (TOPS/W)	~0.1 - 0.5 TOPS/W	2 TOPS/W
CPU Utilization During Inference	80-100%	< 10% (manages NPU)
System Power Draw (Active Inference)	High (CPU fully active)	Low (CPU idle, NPU efficient)
Latency for a 50-layer CNN	500-2000 ms	10-100 ms
Memory Bandwidth Pressure	High (weights & activations via bus)	Reduced (weights cached in NPU)
Supported Data Types	INT8, INT16, FP32 (via CMSIS-NN)	INT8 (primary), INT16, INT4 (configurable)
Model Porting Effort	Lower (CMSIS-NN kernels)	Requires NPU-specific compilation & tuning
Deterministic Execution Timing	Yes (CPU-based)	Yes (NPU is a deterministic hardware block)
Hardware Cost & Silicon Area	N/A (uses existing CPU)	Additional die area for NPU macro

ARM ETHOS-U55

Frequently Asked Questions

Essential questions and answers about the Arm Ethos-U55 microNPU, a dedicated hardware accelerator for machine learning inference in deeply embedded systems.

The Arm Ethos-U55 is a micro Neural Processing Unit (microNPU), a dedicated hardware accelerator designed to offload and execute machine learning inference workloads from an Arm Cortex-M series CPU. It works by receiving compiled neural network subgraphs, which it processes using highly optimized, parallel compute units for operations like convolutions and activations. The U55 features a configurable architecture, allowing silicon designers to scale its performance and area by adjusting the number of macs per cycle and the size of its internal weight and activation memory. It operates in a tightly coupled memory (TCM) system, enabling low-latency data exchange with the host CPU, and is managed through a driver that integrates with frameworks like TensorFlow Lite Micro.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Ethos-U55

What is Ethos-U55?