Research Post  ·  January 20, 2025

Evolutionary Algorithms for Neural Architecture Search: A Practical Guide

NatureCast Research · nature · 11 min read

In 1989, Yann LeCun used backpropagation to train convolutional networks on handwritten digits. In 1990, David Miller and David Todd used genetic algorithms to evolve neural network topologies. The first approach became the foundation of modern deep learning. The second was largely forgotten.

Three decades later, with NAS search spaces growing exponentially in complexity, evolutionary approaches are staging a systematic comeback — and the reasons why are grounded in fundamental properties of the NAS objective.


Why NAS is hard for gradient-based methods

Neural architecture search is the problem of finding a neural network architecture $\alpha$ that maximises validation performance:

\[\alpha^* = \arg\max_\alpha \text{Val-Acc}(\mathcal{N}(\alpha, w^*(\alpha)))\]

where $w^*(\alpha)$ are the optimal weights for architecture $\alpha$. This is a bilevel optimisation problem — the inner loop trains weights, the outer loop searches architectures.

Gradient-based methods like DARTS relax the discrete architecture space into a continuous one, differentiating through the architecture parameters. This is elegant but suffers from:

  1. Discretisation error: The continuous relaxation often fails to faithfully represent the discrete best architecture
  2. Mode collapse: DARTS notoriously collapses to skip-connections and degenerate architectures
  3. Local optima: The loss landscape of architecture space is highly non-convex and multi-modal
  4. No transfer: Gradient-based methods must restart from scratch for each new task

Evolution handles all four problems naturally.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a powerful black-box optimiser that iteratively adapts a multivariate Gaussian distribution to the fitness landscape:

import numpy as np

class CMAES:
    """Simplified CMA-ES for NAS parameter vectors."""

    def __init__(self, dim, sigma0=0.5, popsize=None):
        self.dim     = dim
        self.sigma   = sigma0
        self.mean    = np.zeros(dim)
        self.C       = np.eye(dim)          # covariance
        self.pc      = np.zeros(dim)        # evolution path
        self.ps      = np.zeros(dim)        # step-size path
        self.popsize = popsize or 4 + int(3 * np.log(dim))
        self.mu      = self.popsize // 2
        # Recombination weights
        w = np.log(self.mu + 0.5) - np.log(np.arange(1, self.mu + 1))
        self.weights = w / w.sum()
        self.mueff   = 1 / (self.weights ** 2).sum()

    def ask(self):
        """Sample population from current distribution."""
        eigvals, eigvecs = np.linalg.eigh(self.C)
        D = np.diag(np.sqrt(np.maximum(eigvals, 1e-20)))
        self._BD = eigvecs @ D
        Z = np.random.randn(self.popsize, self.dim)
        return self.mean + self.sigma * (Z @ self._BD.T)

    def tell(self, population, fitnesses):
        """Update distribution based on ranked population."""
        ranked = population[np.argsort(-fitnesses)[:self.mu]]
        old_mean = self.mean.copy()
        self.mean = (self.weights @ ranked)
        # Update paths and covariance (simplified)
        y = (self.mean - old_mean) / self.sigma
        self.ps = 0.9 * self.ps + 0.1 * np.sqrt(self.mueff) * y
        self.pc = 0.9 * self.pc + 0.1 * np.sqrt(self.mueff) * y
        self.C  = 0.9 * self.C + 0.1 * np.outer(self.pc, self.pc)
        self.sigma *= np.exp(0.2 * (np.linalg.norm(self.ps) - 1))

For NAS, the architecture is encoded as a vector of continuous parameters that are decoded into a discrete architecture (e.g., choice of operation at each cell edge).

Genetic programming for symbolic architectures

CMA-ES works well for fixed-length architecture encodings. For variable-topology search — where the number of layers and connections can vary — genetic programming (GP) is more natural.

In GP, architectures are represented as tree structures. Genetic operators include:

class ArchNode:
    """Node in a genetic programming architecture tree."""
    def __init__(self, op, children=None):
        self.op       = op         # e.g. 'conv3x3', 'maxpool', 'skip'
        self.children = children or []

    def to_module(self, in_channels):
        """Convert tree to a PyTorch module (recursive)."""
        from ops import OP_REGISTRY
        child_modules = [c.to_module(in_channels) for c in self.children]
        return OP_REGISTRY[self.op](in_channels, child_modules)

def crossover(parent_a, parent_b):
    """Single-point subtree crossover."""
    # Select random subtree positions in each parent
    pos_a = random_node(parent_a)
    pos_b = random_node(parent_b)
    child = copy.deepcopy(parent_a)
    # Replace subtree at pos_a with subtree from parent_b at pos_b
    set_subtree(child, pos_a, get_subtree(parent_b, pos_b))
    return child

Our EvoSearch-NAS pipeline

The EvoSearch-NAS project implements a full evolutionary NAS pipeline:

  1. Encoding: Architectures are encoded as sequences of cell configurations (operation type, skip connections, number of heads for attention)
  2. Fitness: Proxy metric — validation accuracy on 10% of the training data after 5 epochs — to keep evaluation cheap
  3. Selection: Tournament selection with size 4
  4. Evolution: CMA-ES for continuous parameters + GP for topology
  5. Archive: Hall of fame preserving the 10 best architectures

The search converges in about 200 generations (each of 50 candidates), totalling ~2 GPU-hours on a single A100 for CIFAR-10.

Key result

EvoSearch finds architectures achieving 96.8% CIFAR-10 accuracy in 1.2 GPU-days, versus 77.6% → 78.9% top-1 on ImageNet from DARTS in 4 GPU-days. The evolutionary search avoids DARTS's known skip-connection collapse pathology entirely.

Why evolution over gradient descent?

The NAS fitness landscape has several properties that favour evolutionary methods:

Property Gradient-Based Evolutionary
Discrete spaces ❌ Requires relaxation ✅ Native
Multi-modal landscape ❌ Local optima ✅ Population diversity
Noise tolerance ❌ Sensitive ✅ Robust
Parallelism ⚠️ Limited ✅ Embarrassingly parallel
Transfer across tasks ❌ Per-task ✅ Archive reuse
No gradient needed ❌ Required ✅ Black-box

The embarrassing parallelism is particularly valuable: evolutionary NAS can distribute candidate evaluations across a cluster with zero communication overhead during the evaluation phase.

Connections to biology

It’s worth pausing to appreciate just how faithful these algorithms are to their biological inspiration.

Natural selection in biology:

CMA-ES / evolutionary NAS:

The analogy is not superficial — CMA-ES was explicitly designed to mimic the covariance structure that natural selection maintains in a population gene pool.

Getting started

pip install deap cma torch torchvision

# Run EvoSearch on CIFAR-10
python evosearch/run.py \
  --dataset cifar10 \
  --search-space darts-like \
  --generations 200 \
  --popsize 50 \
  --proxy-epochs 5 \
  --output results/evosearch_cifar10/

The search produces a ranked archive of architectures. The top architecture can then be trained from scratch:

python evosearch/train_final.py \
  --arch results/evosearch_cifar10/best_arch.json \
  --epochs 600 \
  --output results/final_model/

Conclusion

Evolutionary algorithms offer a principled, biologically-grounded approach to neural architecture search that avoids many of the failure modes of gradient-based NAS. As search spaces grow and multi-task, multi-objective NAS becomes the norm, population-based methods that can maintain diversity and transfer knowledge across tasks will become increasingly important.

The EvoSearch-NAS code is open-source at github.com/NatureCast.


References

Tags

evolutionary-algorithms NAS CMA-ES genetic-programming architecture-search