Glossary

On-Device SDK

Developer demonstrating multi-agent tool use, agent tool selection interface on laptop, casual tech demo moment.

TINYML FRAMEWORKS

What is an On-Device SDK?

An on-device SDK is a vendor-specific software development kit that provides libraries, APIs, and tools to develop applications that include local, on-device machine learning inference, typically for a family of microcontrollers or processors.

An On-Device SDK is a specialized software development kit provided by a silicon or platform vendor to enable machine learning inference directly on the target hardware, bypassing cloud dependency. It contains optimized neural network kernels, a minimal inference runtime, and hardware-specific compilers to transform standard models into efficient code. This SDK is the critical bridge between a trained model and its deterministic execution in a resource-constrained embedded system, handling low-level memory management and processor-specific optimizations.

These SDKs, such as STM32Cube.AI or the Arm Ethos-U55 NPU SDK, are essential for leveraging dedicated hardware accelerators like microNPUs. They perform graph optimizations like operator fusion and translate models into deployable formats like C arrays or FlatBuffers. The SDK ensures the TinyML deployment workflow is hardware-aware, maximizing performance and minimizing latency and power consumption for the specific microcontroller or system-on-chip architecture.

TINYML FRAMEWORKS

Core Components of an On-Device SDK

An On-Device SDK provides the essential software tools to integrate local machine learning inference into microcontroller applications. Its core components handle model conversion, hardware acceleration, and efficient runtime execution.

Model Converter & Optimizer

This is the primary tool that transforms a trained model from a standard format (like TensorFlow, PyTorch, or ONNX) into a hardware-optimized representation for the target microcontroller. Key functions include:

Graph Optimization: Applying techniques like operator fusion and constant folding to reduce computational steps.
Quantization: Converting model weights and activations from 32-bit floating-point to 8-bit integers (INT8) or other lower-precision formats to drastically shrink model size and speed up inference.
Pruning: Removing insignificant neurons or weights to create a sparser, more efficient model.
Output generation in formats like C array or FlatBuffer for direct embedding into firmware.

Hardware-Accelerated Kernels

These are highly optimized low-level functions that execute the fundamental mathematical operations of a neural network (like convolution, pooling, and fully-connected layers). Their performance is critical. They are typically:

Hand-written in assembly or optimized C for specific CPU instruction sets (e.g., Arm Cortex-M with DSP extensions).
Designed to leverage single instruction, multiple data (SIMD) instructions for parallel computation.
Provided for dedicated AI coprocessors or microNPUs (like the Arm Ethos-U55), where they act as driver libraries that offload entire layers from the main CPU.
Examples include the functions in CMSIS-NN for Arm Cortex-M or vendor-specific NPU kernel libraries.

Inference Engine / Micro Interpreter

This is the minimal runtime that executes the optimized model on the device. It is responsible for the inference lifecycle:

Model Parsing: Reading the optimized model file (e.g., FlatBuffer) from ROM.
Memory Planning: Allocating a tensor arena in SRAM for intermediate activations and managing this limited memory pool efficiently across layers.
Graph Scheduling: Sequencing the execution of the model's operators, invoking the appropriate hardware-accelerated kernels.
Resource Management: Handling the device's constraints, such as avoiding dynamic memory allocation and ensuring deterministic execution timing.

Deployment & Profiling Tools

A suite of utilities that bridge development and real-device validation. These tools ensure the model works correctly within system constraints:

Cross-Compilation Toolchain: Integrates with standard embedded toolchains (like GCC Arm) to compile the generated model code and runtime into the final firmware binary.
Memory Profiler: Analyzes SRAM usage (especially the tensor arena) and Flash consumption by the model weights and code.
Performance Profiler: Measures per-layer and total inference latency (in milliseconds) and CPU cycle counts, identifying bottlenecks.
Accuracy Validator: Compares the quantized model's output on the device against the original floating-point model's output to verify minimal accuracy loss.

Hardware Abstraction Layer (HAL)

A thin software layer that provides a uniform interface for the inference engine to access specific hardware features, ensuring portability across a vendor's microcontroller family. It abstracts:

Memory-mapped registers for AI accelerators or DSP units.
Direct Memory Access (DMA) controllers for efficient data movement.
Power management interfaces for putting the accelerator into low-power states when idle.
System timers used for latency measurement. This allows the same SDK to support multiple chips (e.g., an entire STM32 or ESP32 series) with a single API.

Reference Applications & Model Zoo

Practical, ready-to-run examples that demonstrate SDK capabilities and serve as a starting point for development. This component includes:

End-to-end projects for common TinyML use cases: keyword spotting, visual wake words, anomaly detection in sensor data.
Pre-optimized models (a model zoo) that are already quantized, pruned, and validated for the target hardware, showcasing achievable accuracy and performance benchmarks.
Sample code illustrating the full deployment workflow: from capturing sensor data, running inference, to taking an action (like toggling a GPIO).
These applications are critical for reducing time-to-prototype and establishing performance baselines.

ARCHITECTURAL COMPARISON

On-Device SDK vs. General-Purpose TinyML Frameworks

A feature-by-feature comparison of vendor-specific On-Device SDKs and cross-platform, general-purpose TinyML frameworks, highlighting key trade-offs for deployment on microcontrollers.

Feature / Metric	On-Device SDK (e.g., STM32Cube.AI, ESP-DL)	General-Purpose Framework (e.g., TensorFlow Lite Micro, TVM Micro)
Primary Design Goal	Maximize performance & power efficiency for a specific vendor's silicon (MCU, NPU).	Provide portable inference across diverse microcontroller architectures.
Hardware Optimization
Supported Hardware Targets	Single vendor's MCU/NPU family (e.g., STM32, ESP32).	Broad range of Arm Cortex-M, RISC-V, Xtensa cores.
Integration with Vendor Tools	Tightly integrated into vendor IDE, HAL, and debugging ecosystem.	Requires manual integration with board support package and toolchain.
Model Format Support	Typically limited (e.g., TensorFlow Lite, ONNX).	Broad (TFLite, PyTorch via ONNX, Keras, etc.).
Graph & Operator Optimization	Extensive, hardware-aware fusions and kernel replacements.	General graph optimizations (constant folding, operator fusion).
Memory Footprint (Runtime)	< 15 KB typical	20-50 KB typical
Inference Latency	Often 1.5-3x faster due to hand-tuned kernels.	Baseline performance; varies with target.
Ease of Porting to New Hardware
Access to Low-Level Hardware Features (e.g., DMA, NPU)	Direct, vendor-exposed APIs.	Abstracted; may require custom operator implementation.
Community & Ecosystem Support	Vendor-driven documentation and forums.	Large open-source community, academic contributions.
Long-Term Maintenance Risk	Tied to vendor's product roadmap and support.	Driven by open-source project health and community.

VENDOR TOOLCHAINS

Examples of On-Device SDKs

These are specialized software development kits provided by silicon vendors and framework developers to compile, optimize, and deploy machine learning models directly onto their target microcontroller families or hardware accelerators.

TensorFlow Lite Micro (TFLM)

A cross-platform, open-source inference framework from Google designed to run neural networks on microcontrollers with only kilobytes of memory. It uses a micro interpreter to execute models serialized in the FlatBuffer format and provides a portable C++ 11 API. It is the reference framework for many academic and commercial TinyML projects.

EXPLORE

STM32Cube.AI

STMicroelectronics' expansion pack for their STM32CubeMX configuration tool. It converts pre-trained models from frameworks like TensorFlow and PyTorch into optimized C code for deployment across the STM32 microcontroller portfolio. It supports post-training quantization and can target devices with or without hardware accelerators, generating code that integrates directly into the STM32 HAL.

EXPLORE

ESP-DL

Espressif Systems' deep learning library for their ESP32 and ESP32-S series chips. It provides highly optimized neural network operations that leverage the chip's vector instructions and, on the ESP32-S3, its matrix multiplication unit. The SDK includes tools to convert TensorFlow models and a programming guide for implementing computer vision and audio applications on the edge.

EXPLORE

Arm CMSIS-NN

A collection of efficient neural network kernels developed by Arm as part of the Cortex Microcontroller Software Interface Standard. It provides hand-optimized, fixed-point functions for Arm Cortex-M processor cores (Cortex-M0, M3, M4, M7, M33, M55). CMSIS-NN is often used as the backend kernel library for higher-level frameworks like TFLM to maximize performance on Arm-based silicon.

EXPLORE

NVIDIA JetPack SDK

While targeting more powerful systems-on-module like the Jetson series, JetPack is a quintessential on-device SDK for edge AI. It includes the TensorRT SDK for high-performance inference optimization, deep learning libraries like cuDNN, and full OS support. It demonstrates the progression of on-device SDKs from microcontrollers to powerful embedded AI computers.

EXPLORE

Qualcomm AI Engine Direct SDK

Provides low-level access to the heterogeneous compute cores (Hexagon DSP, Adreno GPU, Kryo CPU) within Qualcomm Snapdragon platforms. It allows developers to hand-optimize neural network execution by targeting specific accelerators. This SDK is key for deploying high-performance, power-efficient AI on smartphones, IoT hubs, and automotive platforms.

EXPLORE

TINYML DEPLOYMENT

Typical Deployment Workflow with an On-Device SDK

This workflow defines the systematic, multi-stage process for converting a trained machine learning model into a functional application on a microcontroller using a vendor-specific software development kit.

The workflow begins with model conversion and optimization, where a trained neural network from a framework like TensorFlow or PyTorch is imported into the SDK. The SDK's tools apply hardware-aware graph optimizations, quantization, and pruning to reduce the model's computational and memory footprint for the target microcontroller. The output is a hardware-optimized model file, often in a format like a FlatBuffer or a C array, ready for integration.

The final stage is firmware integration and validation. The optimized model and the SDK's inference runtime libraries are linked into the device's application firmware. Developers use the SDK's profiling tools to measure real-world latency, peak memory usage, and power consumption on the target hardware. Successful validation concludes the workflow, resulting in a production-ready binary for deployment to a device fleet.

ON-DEVICE SDK

Frequently Asked Questions

An On-Device SDK is a vendor-specific software development kit that provides libraries, APIs, and tools to develop applications with local, on-device machine learning inference, typically for a family of microcontrollers or processors. Below are key questions about its role, components, and integration within the TinyML ecosystem.

An On-Device SDK is a vendor-specific software development kit that provides the libraries, APIs, and tools necessary to compile, optimize, and execute machine learning models directly on a target microcontroller or processor. It works by taking a trained neural network model (e.g., a TensorFlow Lite FlatBuffer) and converting it into highly optimized C code or machine code that can be linked into the device's firmware. The SDK typically includes a minimal inference runtime (Micro Interpreter), a set of hardware-optimized neural network kernels (like those in CMSIS-NN), and compiler tools that perform critical graph optimizations such as operator fusion to reduce memory overhead and latency. The final output is a statically linked binary where the model is often stored as a C array model within the program memory, enabling inference without a filesystem.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

TINYML FRAMEWORKS

Related Terms

An On-Device SDK is a critical component of the TinyML deployment stack. These related terms define the specific tools, formats, and hardware that enable efficient machine learning on microcontrollers.

Embedded ML Framework

An Embedded ML Framework is a software library or toolchain specifically engineered to enable the deployment and execution of machine learning models on microcontroller-based embedded systems. Unlike general-purpose frameworks, they are built with severe constraints in mind:

Minimal memory footprint for code and runtime.
Optimized kernels for fixed-point or low-bit arithmetic.
Hardware-abstraction layers for portability across MCU architectures. Examples include TensorFlow Lite Micro, CMSIS-NN, and proprietary vendor SDKs. They provide the foundational APIs for loading models, managing tensor memory, and executing inference.

Micro-Compiler

A Micro-Compiler in TinyML is a specialized compiler that translates high-level neural network models (e.g., from TensorFlow or ONNX) into highly optimized, low-level code for microcontroller execution. Its primary function is ahead-of-time (AOT) compilation to eliminate runtime interpretation overhead. Key features include:

Target-specific optimization: Generates C code or machine code tuned for a specific MCU core (e.g., Arm Cortex-M).
Memory planning: Statically allocates buffers for weights and activations.
Operator lowering: Converts framework-specific operations into sequences of efficient, hardware-aware kernels. Tools like Apache TVM's MicroTVM and vendor NPU compilers are examples.

FlatBuffer Model

A FlatBuffer Model is a neural network model serialized using the FlatBuffers cross-platform serialization library. It is the standard, memory-efficient format used by TensorFlow Lite and TensorFlow Lite Micro (TFLM). Its design is critical for microcontrollers:

Zero-copy deserialization: Models can be read directly from flash memory without loading into RAM first, saving precious memory.
Schema evolution: Supports forward/backward compatibility.
Small binary size: Efficient encoding reduces storage footprint on device. During deployment, the model FlatBuffer file is typically converted into a C array and compiled directly into the firmware binary.

AI Coprocessor / microNPU

An AI Coprocessor, often called a microNPU (Neural Processing Unit), is a dedicated hardware accelerator integrated into a microcontroller or system-on-chip to offload and dramatically accelerate neural network inference. It is a key enabler for complex models on power-constrained devices.

Specialized datapaths: Designed for the matrix and convolution operations central to neural networks.
Extreme efficiency: Delivers orders of magnitude better performance-per-watt than a general-purpose CPU core.
Vendor SDK dependency: Requires a proprietary NPU SDK for model compilation and driver integration. Examples include the Arm Ethos-U55 and accelerators in Espressif ESP32 and STM32 families.

Tensor Arena

The Tensor Arena is a statically or dynamically allocated block of memory (typically SRAM) used by a TinyML inference engine as a scratchpad for temporary data during model execution. Efficient management is paramount for MCUs with only tens of kilobytes of RAM.

Stores intermediate activations: Holds the output tensors from each layer as the graph executes.
Memory planning: Advanced frameworks perform static memory planning to minimize the arena's total size by reusing memory buffers for tensors with non-overlapping lifetimes.
Performance critical: Located in fast SRAM, its size and management strategy directly impact which models can run and their latency.

Deployment Workflow

The TinyML Deployment Workflow is the end-to-end process of converting a trained model into a production application on a microcontroller. It is a multi-stage pipeline distinct from cloud ML deployment. Key stages include:

Model Export & Conversion: Export to a portable format (e.g., TensorFlow Lite) and potentially convert for a specific framework.
Optimization & Quantization: Apply post-training quantization and pruning to reduce model size.
Compilation & Code Generation: Use a micro-compiler to generate optimized C code for the target MCU.
Firmware Integration: Link the model as a C array, integrate the inference engine, and write application logic.
Profiling & Validation: Use benchmarks like MLPerf Tiny to verify performance, accuracy, and memory usage on real hardware.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

On-Device SDK

What is an On-Device SDK?

Core Components of an On-Device SDK

Model Converter & Optimizer

Hardware-Accelerated Kernels

Inference Engine / Micro Interpreter

Deployment & Profiling Tools

Hardware Abstraction Layer (HAL)

Reference Applications & Model Zoo

On-Device SDK vs. General-Purpose TinyML Frameworks

Examples of On-Device SDKs

TensorFlow Lite Micro (TFLM)

STM32Cube.AI

ESP-DL

Arm CMSIS-NN

NVIDIA JetPack SDK

Qualcomm AI Engine Direct SDK

Typical Deployment Workflow with an On-Device SDK

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there