Inferensys

Glossary

Neural Architecture Search (NAS)

Neural Architecture Search (NAS) is an automated machine learning method that discovers optimal neural network architectures by exploring a vast design space, balancing accuracy with constraints like latency, size, and power consumption.
Stylish WeWork-like workspace with hot desks and document wall, professional searching through enterprise knowledge base on a mounted ultrawide display, warm industrial pendants overhead.
AUTOMATED MODEL DESIGN

What is Neural Architecture Search (NAS)?

Neural Architecture Search (NAS) is a subfield of automated machine learning (AutoML) focused on algorithmically discovering optimal neural network architectures.

Neural Architecture Search (NAS) is an automated process for designing optimal neural network topologies. It treats architecture design as a search problem within a vast, pre-defined space of possible layer types, connections, and hyperparameters. The goal is to find a model that maximizes a given objective, typically balancing task accuracy against deployment constraints like latency, model size, or power consumption. This search is guided by a controller, often a reinforcement learning agent or an evolutionary algorithm.

For Tiny Machine Learning deployment, Hardware-Aware NAS is critical. Here, the search objective directly incorporates metrics from the target microcontroller, such as SRAM usage or inference cycles. This allows the automated discovery of extremely efficient architectures like Once-For-All Networks, where a single supernet can yield many specialized submodels. NAS thus moves beyond manual design, enabling the creation of models intrinsically optimized for severe edge constraints.

NEURAL ARCHITECTURE SEARCH

Core Components of a NAS System

Neural Architecture Search automates the design of optimal neural networks by exploring a vast space of possible architectures. Its core components define the search strategy, the space of possible models, and the evaluation method.

01

Search Space

The search space defines the universe of all possible neural network architectures the NAS algorithm can consider. For TinyML, this space is heavily constrained by hardware targets.

  • Macro-architectural choices: Number of layers, types of layers (convolution, depthwise convolution, fully connected), and connection patterns (e.g., residual blocks, inverted residuals).
  • Micro-architectural parameters: Kernel size (3x3, 5x5), number of filters per layer, expansion ratios for mobile networks, and activation functions.
  • Hardware-aware constraints: The space is often limited to operations and topologies known to be efficient on microcontrollers, such as avoiding large fully connected layers and favoring depthwise separable convolutions.
02

Search Strategy

The search strategy is the algorithm that navigates the search space to discover high-performing architectures. It balances exploration of new designs with exploitation of promising regions.

  • Reinforcement Learning (RL): Uses an RNN controller to generate architecture descriptions, which are then trained and rewarded based on validation accuracy.
  • Evolutionary Algorithms: Applies genetic operations (mutation, crossover) to a population of architectures, selecting the fittest for the next generation.
  • Gradient-Based Methods: Treats the architecture selection as a continuous optimization problem (e.g., using Differentiable Architecture Search (DARTS)), allowing efficient search via gradient descent.
  • Bayesian Optimization: Models the relationship between architecture and performance as a probabilistic surrogate function to guide the search.
03

Performance Estimation Strategy

The performance estimation strategy is the method for evaluating a candidate architecture's quality (e.g., accuracy, latency) without the prohibitive cost of fully training every candidate from scratch.

  • Low-Fidelity Estimation: Training candidates for fewer epochs or on a smaller dataset.
  • Weight Sharing / One-Shot Models: Training a single, over-parameterized supernet (like a Once-For-All network) that shares weights across all candidate subnetworks. Performance is estimated by evaluating different subgraphs of this supernet.
  • Predictor-Based: Training a surrogate model (e.g., a neural network or regression model) to predict the final performance of an architecture based on its metadata or a learned embedding.
  • Hardware-in-the-Loop Profiling: Directly measuring true latency, memory usage, and power consumption by deploying and profiling the candidate on the target microcontroller or a cycle-accurate simulator.
04

Hardware-Aware Objectives

For TinyML, the NAS objective function extends beyond validation accuracy to include critical hardware deployment metrics. The algorithm searches for architectures that optimize a multi-objective cost function.

  • Primary Objectives: Model accuracy on the target task.
  • Hardware Constraints: Peak RAM usage, flash storage footprint, inference latency (in milliseconds), and energy consumption (in millijoules per inference).
  • Multi-Objective Optimization: Often formulated as finding architectures on the Pareto frontier, the set of designs where no single metric (e.g., accuracy) can be improved without degrading another (e.g., latency).
  • Example Objective: Maximize(Accuracy) subject to: Latency < 50ms, PeakRAM < 256KB.
AUTOMATED ARCHITECTURE DESIGN

How Does Neural Architecture Search Work?

Neural Architecture Search (NAS) automates the design of neural network topologies by treating architecture selection as an optimization problem.

Neural Architecture Search (NAS) is an automated process for discovering high-performing neural network architectures for a specific task and set of constraints. It formulates architecture design as an optimization problem within a vast, predefined search space of possible layer types, connections, and hyperparameters. An optimization search strategy, such as reinforcement learning, evolutionary algorithms, or gradient-based methods, explores this space. A performance estimation strategy, like training on a subset of data or using a proxy network, evaluates candidate architectures to guide the search toward optimal designs.

The process is computationally intensive, often requiring thousands of GPU hours. Modern advancements focus on hardware-aware NAS, where the search objective includes metrics like on-device latency, memory footprint, and power consumption for target hardware like microcontrollers. Techniques like weight sharing in a supernet (e.g., a Once-For-All network) dramatically reduce search cost by allowing subnetworks to share weights, enabling efficient evaluation of many architectures from a single trained model.

SEARCH ALGORITHMS

Comparing NAS Search Strategies

A comparison of the primary algorithmic approaches used to explore the neural architecture search space, highlighting trade-offs in computational cost, search efficiency, and hardware awareness.

Search StrategyCore MechanismComputational CostSearch EfficiencyHardware-AwarenessTypical Use Case

Reinforcement Learning (RL)

Uses an RNN controller trained with policy gradients to propose architectures, rewarding high accuracy.

Very High

Low to Moderate

Possible via reward shaping

Early NAS research; exploring novel, unconstrained spaces.

Evolutionary Algorithms (EA)

Applies genetic operations (mutation, crossover) to a population of architectures, selecting based on fitness (accuracy).

Very High

Moderate

Possible via fitness function

When a diverse population of high-performing architectures is desired.

Bayesian Optimization (BO)

Builds a probabilistic surrogate model (e.g., Gaussian Process) to predict architecture performance and guide sampling.

High (for model building)

High

Difficult to integrate

When the search space is small but evaluating a candidate is very expensive.

Gradient-Based (e.g., DARTS)

Relaxes the discrete search space to be continuous, enabling architecture parameters to be optimized via gradient descent alongside network weights.

Moderate

Very High

Native via continuous relaxation

Rapid prototyping and research; requires significant memory during search.

One-Shot / Weight-Sharing

Trains a single over-parameterized supernet once; candidate architectures are evaluated as subnets sharing the supernet's weights.

Low

Very High

Native via supernet design

Production NAS for constrained devices (TinyML); efficient search on large spaces.

Random Search

Samples architectures uniformly at random from the search space and evaluates them independently.

Linear w/ samples

Low

None

Strong baseline; useful for small spaces or when other methods are infeasible.

Sequential Model-Based Optimization (SMBO)

Iteratively fits a model to observed (architecture, performance) pairs to propose the next promising candidate.

Moderate to High

High

Possible via acquisition function

Balancing efficiency and performance in medium-sized search spaces.

HARDWARE-AWARE OPTIMIZATION

NAS Applications in TinyML & Edge AI

Neural Architecture Search (NAS) automates the design of neural networks specifically optimized for the severe constraints of microcontrollers and edge devices, balancing accuracy with metrics like latency, memory footprint, and power consumption.

01

Hardware-Aware Search Objectives

Unlike traditional NAS focused solely on accuracy, TinyML NAS incorporates hardware-specific metrics as primary search objectives. The search algorithm is guided by a multi-objective reward function that penalizes architectures exceeding target constraints.

Key search targets include:

  • Peak RAM/Flash Usage: Models must fit within the microcontroller's limited SRAM (often < 512KB) and flash memory (often < 1MB).
  • Inference Latency: Measured in milliseconds, targeting real-time sensor processing (e.g., < 100ms for keyword spotting).
  • Energy Consumption: Estimated in millijoules per inference, critical for battery-powered devices.
  • Multiply-Accumulate (MAC) Operations: A proxy for compute cost, directly correlated with latency and energy.

Search spaces are constrained from the start, excluding operations like large dense layers or standard convolutions that are prohibitive on MCUs.

02

Search Space Design for MCUs

The set of possible operations and connections (the search space) is meticulously crafted for microcontroller efficiency. It heavily favors depthwise separable convolutions, which drastically reduce parameters and MACs compared to standard convolutions. Common building blocks include:

  • MobileNet-style inverted residuals: Efficient blocks with expansion and projection layers.
  • Squeeze-and-Excitation (SE) attention: Lightweight channel-wise gating to boost accuracy with minimal cost.
  • Grouped & pointwise convolutions: To further reduce computational complexity.
  • Activation functions: Choices like ReLU6 (clipped ReLU) are preferred for simpler quantization.
  • Skip connections: To facilitate gradient flow in very deep, efficient networks.

The search space explicitly excludes operations with high memory or compute overhead, such as batch normalization during inference (folded into weights) or large kernel sizes.

03

Once-For-All & Supernet Paradigm

Training a unique model for every device variant is infeasible. The Once-For-All (OFA) approach trains a single, large supernet encompassing many possible subnetworks of different depths, widths, and kernel sizes. After this one-time training, specialized submodels for specific latency or memory targets can be extracted without retraining.

This is ideal for TinyML because:

  • Manufacturing Variance: A single supernet can yield models for MCUs with slightly different clock speeds or memory.
  • Dynamic Voltage & Frequency Scaling (DVFS): A device can extract a smaller submodel when on battery power and a larger one when plugged in.
  • Product Families: One supernet serves an entire product line, from a basic to a premium sensor.

The supernet is trained with progressive shrinking, gradually introducing smaller subnetworks into the training process to maintain the accuracy of all contained architectures.

04

Differentiable NAS & Gradient-Based Search

Differentiable Architecture Search (DARTS) is a popular NAS method adapted for TinyML. Instead of evaluating thousands of discrete architectures, it relaxes the search space to be continuous. A architecture parameter (alpha) is associated with each possible operation (e.g., 3x3 conv, 5x5 depthwise conv, skip connection, zero).

During the search phase, the model is a mixture of all operations, weighted by their alpha parameters. The search optimizes two sets of parameters simultaneously:

  1. The standard network weights (using gradient descent).
  2. The architecture parameters, alpha (using gradient descent).

After search, a discrete architecture is derived by retaining only the operation with the highest alpha value at each choice point. This method is significantly more compute-efficient than reinforcement learning or evolutionary-based NAS, making it more accessible for edge-focused research.

05

Co-Search of Architecture & Quantization Policy

The most advanced TinyML NAS frameworks perform joint architecture and quantization policy search. The search algorithm doesn't just choose operations; it also decides the bit-width (precision) for weights and activations for each layer.

For example, the search might learn that:

  • The first convolutional layer benefits from 8-bit weights and activations for robustness.
  • Intermediate depthwise layers can be reduced to 4-bit with minimal accuracy loss.
  • The final classification layer requires 8-bit activations.

This results in heterogeneously quantized models where the precision varies per layer, achieving an optimal accuracy-to-efficiency trade-off. The search reward function directly incorporates the cost of mixed-precision arithmetic on the target hardware.

06

Real-World Application: Keyword Spotting

A canonical TinyML NAS success is designing models for always-on keyword spotting (e.g., "Hey Google," "Alexa") on microcontrollers. The constraints are extreme: ~250KB RAM, ~50ms latency, and microwatts of power.

NAS-discovered architectures for this task, such as variants of BC-ResNet or TC-ResNet, typically feature:

  • 1D temporal convolutions instead of 2D spectrogram processing.
  • Carefully tuned dilation rates to capture long-range audio dependencies without increasing kernel size.
  • Extremely narrow channel widths in early layers, expanding only where necessary.

These NAS-generated models consistently outperform hand-designed baselines like DSCNN, achieving >96% accuracy on the Google Speech Commands dataset while meeting all hardware constraints. They demonstrate NAS's ability to find non-intuitive, highly efficient patterns that human designers might overlook.

< 250KB
Target RAM
> 96%
Accuracy
NEURAL ARCHITECTURE SEARCH

Frequently Asked Questions

Neural Architecture Search (NAS) automates the design of optimal neural network structures. For TinyML, this process is constrained by the severe memory, power, and compute limits of microcontrollers.

Neural Architecture Search (NAS) is an automated machine learning (AutoML) process that discovers high-performing neural network architectures for a given task and set of constraints, rather than relying on manual human design. It works by defining a search space of possible architectural components (e.g., layer types, number of filters, connection patterns), using a search strategy (like reinforcement learning, evolutionary algorithms, or gradient-based methods) to explore this space, and a performance estimation strategy (like training on a validation set or using a proxy) to evaluate candidate architectures. The goal is to find the architecture that best balances a primary objective (like accuracy) with deployment constraints (like model size or latency).

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.