Inferensys

Glossary

On-Device SDK

An On-Device SDK is a vendor-specific software development kit that provides libraries, APIs, and tools to develop applications with local, on-device machine learning inference, typically for microcontrollers.
Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.
TINYML FRAMEWORKS

What is an On-Device SDK?

An on-device SDK is a vendor-specific software development kit that provides libraries, APIs, and tools to develop applications that include local, on-device machine learning inference, typically for a family of microcontrollers or processors.

An On-Device SDK is a specialized software development kit provided by a silicon or platform vendor to enable machine learning inference directly on the target hardware, bypassing cloud dependency. It contains optimized neural network kernels, a minimal inference runtime, and hardware-specific compilers to transform standard models into efficient code. This SDK is the critical bridge between a trained model and its deterministic execution in a resource-constrained embedded system, handling low-level memory management and processor-specific optimizations.

These SDKs, such as STM32Cube.AI or the Arm Ethos-U55 NPU SDK, are essential for leveraging dedicated hardware accelerators like microNPUs. They perform graph optimizations like operator fusion and translate models into deployable formats like C arrays or FlatBuffers. The SDK ensures the TinyML deployment workflow is hardware-aware, maximizing performance and minimizing latency and power consumption for the specific microcontroller or system-on-chip architecture.

TINYML FRAMEWORKS

Core Components of an On-Device SDK

An On-Device SDK provides the essential software tools to integrate local machine learning inference into microcontroller applications. Its core components handle model conversion, hardware acceleration, and efficient runtime execution.

01

Model Converter & Optimizer

This is the primary tool that transforms a trained model from a standard format (like TensorFlow, PyTorch, or ONNX) into a hardware-optimized representation for the target microcontroller. Key functions include:

  • Graph Optimization: Applying techniques like operator fusion and constant folding to reduce computational steps.
  • Quantization: Converting model weights and activations from 32-bit floating-point to 8-bit integers (INT8) or other lower-precision formats to drastically shrink model size and speed up inference.
  • Pruning: Removing insignificant neurons or weights to create a sparser, more efficient model.
  • Output generation in formats like C array or FlatBuffer for direct embedding into firmware.
02

Hardware-Accelerated Kernels

These are highly optimized low-level functions that execute the fundamental mathematical operations of a neural network (like convolution, pooling, and fully-connected layers). Their performance is critical. They are typically:

  • Hand-written in assembly or optimized C for specific CPU instruction sets (e.g., Arm Cortex-M with DSP extensions).
  • Designed to leverage single instruction, multiple data (SIMD) instructions for parallel computation.
  • Provided for dedicated AI coprocessors or microNPUs (like the Arm Ethos-U55), where they act as driver libraries that offload entire layers from the main CPU.
  • Examples include the functions in CMSIS-NN for Arm Cortex-M or vendor-specific NPU kernel libraries.
03

Inference Engine / Micro Interpreter

This is the minimal runtime that executes the optimized model on the device. It is responsible for the inference lifecycle:

  • Model Parsing: Reading the optimized model file (e.g., FlatBuffer) from ROM.
  • Memory Planning: Allocating a tensor arena in SRAM for intermediate activations and managing this limited memory pool efficiently across layers.
  • Graph Scheduling: Sequencing the execution of the model's operators, invoking the appropriate hardware-accelerated kernels.
  • Resource Management: Handling the device's constraints, such as avoiding dynamic memory allocation and ensuring deterministic execution timing.
04

Deployment & Profiling Tools

A suite of utilities that bridge development and real-device validation. These tools ensure the model works correctly within system constraints:

  • Cross-Compilation Toolchain: Integrates with standard embedded toolchains (like GCC Arm) to compile the generated model code and runtime into the final firmware binary.
  • Memory Profiler: Analyzes SRAM usage (especially the tensor arena) and Flash consumption by the model weights and code.
  • Performance Profiler: Measures per-layer and total inference latency (in milliseconds) and CPU cycle counts, identifying bottlenecks.
  • Accuracy Validator: Compares the quantized model's output on the device against the original floating-point model's output to verify minimal accuracy loss.
05

Hardware Abstraction Layer (HAL)

A thin software layer that provides a uniform interface for the inference engine to access specific hardware features, ensuring portability across a vendor's microcontroller family. It abstracts:

  • Memory-mapped registers for AI accelerators or DSP units.
  • Direct Memory Access (DMA) controllers for efficient data movement.
  • Power management interfaces for putting the accelerator into low-power states when idle.
  • System timers used for latency measurement. This allows the same SDK to support multiple chips (e.g., an entire STM32 or ESP32 series) with a single API.
06

Reference Applications & Model Zoo

Practical, ready-to-run examples that demonstrate SDK capabilities and serve as a starting point for development. This component includes:

  • End-to-end projects for common TinyML use cases: keyword spotting, visual wake words, anomaly detection in sensor data.
  • Pre-optimized models (a model zoo) that are already quantized, pruned, and validated for the target hardware, showcasing achievable accuracy and performance benchmarks.
  • Sample code illustrating the full deployment workflow: from capturing sensor data, running inference, to taking an action (like toggling a GPIO).
  • These applications are critical for reducing time-to-prototype and establishing performance baselines.
ARCHITECTURAL COMPARISON

On-Device SDK vs. General-Purpose TinyML Frameworks

A feature-by-feature comparison of vendor-specific On-Device SDKs and cross-platform, general-purpose TinyML frameworks, highlighting key trade-offs for deployment on microcontrollers.

Feature / MetricOn-Device SDK (e.g., STM32Cube.AI, ESP-DL)General-Purpose Framework (e.g., TensorFlow Lite Micro, TVM Micro)

Primary Design Goal

Maximize performance & power efficiency for a specific vendor's silicon (MCU, NPU).

Provide portable inference across diverse microcontroller architectures.

Hardware Optimization

Supported Hardware Targets

Single vendor's MCU/NPU family (e.g., STM32, ESP32).

Broad range of Arm Cortex-M, RISC-V, Xtensa cores.

Integration with Vendor Tools

Tightly integrated into vendor IDE, HAL, and debugging ecosystem.

Requires manual integration with board support package and toolchain.

Model Format Support

Typically limited (e.g., TensorFlow Lite, ONNX).

Broad (TFLite, PyTorch via ONNX, Keras, etc.).

Graph & Operator Optimization

Extensive, hardware-aware fusions and kernel replacements.

General graph optimizations (constant folding, operator fusion).

Memory Footprint (Runtime)

< 15 KB typical

20-50 KB typical

Inference Latency

Often 1.5-3x faster due to hand-tuned kernels.

Baseline performance; varies with target.

Ease of Porting to New Hardware

Access to Low-Level Hardware Features (e.g., DMA, NPU)

Direct, vendor-exposed APIs.

Abstracted; may require custom operator implementation.

Community & Ecosystem Support

Vendor-driven documentation and forums.

Large open-source community, academic contributions.

Long-Term Maintenance Risk

Tied to vendor's product roadmap and support.

Driven by open-source project health and community.

VENDOR TOOLCHAINS

Examples of On-Device SDKs

These are specialized software development kits provided by silicon vendors and framework developers to compile, optimize, and deploy machine learning models directly onto their target microcontroller families or hardware accelerators.

TINYML DEPLOYMENT

Typical Deployment Workflow with an On-Device SDK

This workflow defines the systematic, multi-stage process for converting a trained machine learning model into a functional application on a microcontroller using a vendor-specific software development kit.

The workflow begins with model conversion and optimization, where a trained neural network from a framework like TensorFlow or PyTorch is imported into the SDK. The SDK's tools apply hardware-aware graph optimizations, quantization, and pruning to reduce the model's computational and memory footprint for the target microcontroller. The output is a hardware-optimized model file, often in a format like a FlatBuffer or a C array, ready for integration.

The final stage is firmware integration and validation. The optimized model and the SDK's inference runtime libraries are linked into the device's application firmware. Developers use the SDK's profiling tools to measure real-world latency, peak memory usage, and power consumption on the target hardware. Successful validation concludes the workflow, resulting in a production-ready binary for deployment to a device fleet.

ON-DEVICE SDK

Frequently Asked Questions

An On-Device SDK is a vendor-specific software development kit that provides libraries, APIs, and tools to develop applications with local, on-device machine learning inference, typically for a family of microcontrollers or processors. Below are key questions about its role, components, and integration within the TinyML ecosystem.

An On-Device SDK is a vendor-specific software development kit that provides the libraries, APIs, and tools necessary to compile, optimize, and execute machine learning models directly on a target microcontroller or processor. It works by taking a trained neural network model (e.g., a TensorFlow Lite FlatBuffer) and converting it into highly optimized C code or machine code that can be linked into the device's firmware. The SDK typically includes a minimal inference runtime (Micro Interpreter), a set of hardware-optimized neural network kernels (like those in CMSIS-NN), and compiler tools that perform critical graph optimizations such as operator fusion to reduce memory overhead and latency. The final output is a statically linked binary where the model is often stored as a C array model within the program memory, enabling inference without a filesystem.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.