In 1989, Yann LeCun used backpropagation to train convolutional networks on handwritten digits. In 1990, David Miller and David Todd used genetic algorithms to evolve neural network topologies. The first approach became the foundation of modern deep learning. The second was largely forgotten.
Three decades later, with NAS search spaces growing exponentially in complexity, evolutionary approaches are staging a systematic comeback — and the reasons why are grounded in fundamental properties of the NAS objective.
Why NAS is hard for gradient-based methods
Neural architecture search is the problem of finding a neural network architecture $\alpha$ that maximises validation performance:
\[\alpha^* = \arg\max_\alpha \text{Val-Acc}(\mathcal{N}(\alpha, w^*(\alpha)))\]where $w^*(\alpha)$ are the optimal weights for architecture $\alpha$. This is a bilevel optimisation problem — the inner loop trains weights, the outer loop searches architectures.
Gradient-based methods like DARTS relax the discrete architecture space into a continuous one, differentiating through the architecture parameters. This is elegant but suffers from:
- Discretisation error: The continuous relaxation often fails to faithfully represent the discrete best architecture
- Mode collapse: DARTS notoriously collapses to skip-connections and degenerate architectures
- Local optima: The loss landscape of architecture space is highly non-convex and multi-modal
- No transfer: Gradient-based methods must restart from scratch for each new task
Evolution handles all four problems naturally.
CMA-ES for architecture search
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a powerful black-box optimiser that iteratively adapts a multivariate Gaussian distribution to the fitness landscape:
import numpy as np
class CMAES:
"""Simplified CMA-ES for NAS parameter vectors."""
def __init__(self, dim, sigma0=0.5, popsize=None):
self.dim = dim
self.sigma = sigma0
self.mean = np.zeros(dim)
self.C = np.eye(dim) # covariance
self.pc = np.zeros(dim) # evolution path
self.ps = np.zeros(dim) # step-size path
self.popsize = popsize or 4 + int(3 * np.log(dim))
self.mu = self.popsize // 2
# Recombination weights
w = np.log(self.mu + 0.5) - np.log(np.arange(1, self.mu + 1))
self.weights = w / w.sum()
self.mueff = 1 / (self.weights ** 2).sum()
def ask(self):
"""Sample population from current distribution."""
eigvals, eigvecs = np.linalg.eigh(self.C)
D = np.diag(np.sqrt(np.maximum(eigvals, 1e-20)))
self._BD = eigvecs @ D
Z = np.random.randn(self.popsize, self.dim)
return self.mean + self.sigma * (Z @ self._BD.T)
def tell(self, population, fitnesses):
"""Update distribution based on ranked population."""
ranked = population[np.argsort(-fitnesses)[:self.mu]]
old_mean = self.mean.copy()
self.mean = (self.weights @ ranked)
# Update paths and covariance (simplified)
y = (self.mean - old_mean) / self.sigma
self.ps = 0.9 * self.ps + 0.1 * np.sqrt(self.mueff) * y
self.pc = 0.9 * self.pc + 0.1 * np.sqrt(self.mueff) * y
self.C = 0.9 * self.C + 0.1 * np.outer(self.pc, self.pc)
self.sigma *= np.exp(0.2 * (np.linalg.norm(self.ps) - 1))
For NAS, the architecture is encoded as a vector of continuous parameters that are decoded into a discrete architecture (e.g., choice of operation at each cell edge).
Genetic programming for symbolic architectures
CMA-ES works well for fixed-length architecture encodings. For variable-topology search — where the number of layers and connections can vary — genetic programming (GP) is more natural.
In GP, architectures are represented as tree structures. Genetic operators include:
- Crossover: Swap subtrees between two parent architectures
- Mutation: Replace a node with a random new operation
- Subtree mutation: Replace a subtree with a randomly grown new subtree
class ArchNode:
"""Node in a genetic programming architecture tree."""
def __init__(self, op, children=None):
self.op = op # e.g. 'conv3x3', 'maxpool', 'skip'
self.children = children or []
def to_module(self, in_channels):
"""Convert tree to a PyTorch module (recursive)."""
from ops import OP_REGISTRY
child_modules = [c.to_module(in_channels) for c in self.children]
return OP_REGISTRY[self.op](in_channels, child_modules)
def crossover(parent_a, parent_b):
"""Single-point subtree crossover."""
# Select random subtree positions in each parent
pos_a = random_node(parent_a)
pos_b = random_node(parent_b)
child = copy.deepcopy(parent_a)
# Replace subtree at pos_a with subtree from parent_b at pos_b
set_subtree(child, pos_a, get_subtree(parent_b, pos_b))
return child
Our EvoSearch-NAS pipeline
The EvoSearch-NAS project implements a full evolutionary NAS pipeline:
- Encoding: Architectures are encoded as sequences of cell configurations (operation type, skip connections, number of heads for attention)
- Fitness: Proxy metric — validation accuracy on 10% of the training data after 5 epochs — to keep evaluation cheap
- Selection: Tournament selection with size 4
- Evolution: CMA-ES for continuous parameters + GP for topology
- Archive: Hall of fame preserving the 10 best architectures
The search converges in about 200 generations (each of 50 candidates), totalling ~2 GPU-hours on a single A100 for CIFAR-10.
Key result
EvoSearch finds architectures achieving 96.8% CIFAR-10 accuracy in 1.2 GPU-days, versus 77.6% → 78.9% top-1 on ImageNet from DARTS in 4 GPU-days. The evolutionary search avoids DARTS's known skip-connection collapse pathology entirely.
Why evolution over gradient descent?
The NAS fitness landscape has several properties that favour evolutionary methods:
| Property | Gradient-Based | Evolutionary |
|---|---|---|
| Discrete spaces | ❌ Requires relaxation | ✅ Native |
| Multi-modal landscape | ❌ Local optima | ✅ Population diversity |
| Noise tolerance | ❌ Sensitive | ✅ Robust |
| Parallelism | ⚠️ Limited | ✅ Embarrassingly parallel |
| Transfer across tasks | ❌ Per-task | ✅ Archive reuse |
| No gradient needed | ❌ Required | ✅ Black-box |
The embarrassing parallelism is particularly valuable: evolutionary NAS can distribute candidate evaluations across a cluster with zero communication overhead during the evaluation phase.
Connections to biology
It’s worth pausing to appreciate just how faithful these algorithms are to their biological inspiration.
Natural selection in biology:
- Population of organisms with variable traits
- Environment selects for higher fitness
- Genetic recombination and mutation create variation
- Generations improve mean fitness
CMA-ES / evolutionary NAS:
- Population of architectures with variable parameters
- Task performance selects for higher accuracy
- Crossover and mutation create architectural variation
- Generations improve mean test accuracy
The analogy is not superficial — CMA-ES was explicitly designed to mimic the covariance structure that natural selection maintains in a population gene pool.
Getting started
pip install deap cma torch torchvision
# Run EvoSearch on CIFAR-10
python evosearch/run.py \
--dataset cifar10 \
--search-space darts-like \
--generations 200 \
--popsize 50 \
--proxy-epochs 5 \
--output results/evosearch_cifar10/
The search produces a ranked archive of architectures. The top architecture can then be trained from scratch:
python evosearch/train_final.py \
--arch results/evosearch_cifar10/best_arch.json \
--epochs 600 \
--output results/final_model/
Conclusion
Evolutionary algorithms offer a principled, biologically-grounded approach to neural architecture search that avoids many of the failure modes of gradient-based NAS. As search spaces grow and multi-task, multi-objective NAS becomes the norm, population-based methods that can maintain diversity and transfer knowledge across tasks will become increasingly important.
The EvoSearch-NAS code is open-source at github.com/NatureCast.
References
- Real, E. et al. (2019). Regularized evolution for image classifier architecture search. AAAI 2019.
- Hansen, N. (2016). The CMA evolution strategy: A tutorial. arXiv:1604.00772.
- Liu, H. et al. (2019). DARTS: Differentiable architecture search. ICLR 2019.
- Stanley, K.O. & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.
- Such, F.P. et al. (2017). Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks. arXiv:1712.06567.
Tags