Attention is simultaneously one of the most successful ideas in modern AI and one of the most misunderstood. The scaled dot-product attention introduced in Attention Is All You Need (Vaswani et al., 2017) has powered a decade of dramatic progress in language, vision, and multimodal learning. Yet it captures only the computational outcome of attention — not the rich biological machinery that inspired it.
In this post, we examine what neuroscience actually tells us about attention, and extract concrete design principles for building better artificial neural networks.
What is biological attention?
In cognitive neuroscience, attention refers to the selective amplification of sensory signals that are relevant to current behaviour, combined with the suppression of irrelevant signals. There are at least three distinct systems:
- Spatial attention — orienting toward a location in space (the “spotlight” model)
- Feature-based attention — amplifying specific features (e.g. colour, orientation) across the visual field
- Object-based attention — attending to whole objects rather than locations or features
These systems are implemented by a distributed network of brain areas, with the prefrontal cortex (PFC) providing top-down signals that bias competition in sensory areas.
Key finding
Biological attention is multiplicative: the PFC doesn't add signals to sensory areas — it multiplies the gain of relevant feature detectors. This is fundamentally different from the additive softmax attention used in transformers.
The biased competition model
The most influential computational model of biological attention is Desimone & Duncan’s biased competition framework (1995). In this model:
- Multiple stimuli compete for representation in sensory cortex
- Top-down signals from PFC bias this competition toward task-relevant stimuli
- The “winner” suppresses competing representations
This is strikingly similar to attention in transformers — but with one critical difference: biological competition is non-linear and winner-take-more, not softmax-normalised.
Working memory and the key-value metaphor
The transformer models attention as a retrieval operation over a key-value store:
# Scaled dot-product attention
def attention(Q, K, V):
scores = (Q @ K.T) / math.sqrt(d_k)
weights = softmax(scores)
return weights @ V
This has a biological parallel. The hippocampus acts as a content-addressable memory: a partial query (the query vector Q) retrieves stored patterns (keys K) and returns associated values (V). But biological memory retrieval uses Hebbian completion, not dot products — and retrieval often modifies the memory trace (reconsolidation).
The role of theta oscillations
One of the most striking features of hippocampal memory is its dependence on theta oscillations (~4–8 Hz). During a theta cycle:
- Encoding phase: new information is written into synaptic weights
- Retrieval phase: stored patterns are retrieved and projected to cortex
This alternating encode/retrieve cycle has no equivalent in standard transformers. It suggests that temporally structured attention — where reading and writing occur at different phases — might be substantially more powerful.
Predictive coding: attention as prediction error
An increasingly influential theory — predictive coding (Rao & Ballard, 1999; Friston, 2010) — reframes perception as inference. The brain maintains a generative model of the world, and attention is directed toward prediction errors — the places where the model’s predictions fail to match incoming sensory signals.
This is conceptually similar to cross-attention in encoder-decoder transformers, where the decoder queries the encoder for the information most needed to resolve uncertainty. But predictive coding is hierarchical and bidirectional — there is no clean encoder/decoder split.
Design principles for bio-inspired attention
Drawing on the neuroscience, we identify five principles that current transformers largely violate:
| Principle | Biology | Standard Transformer |
|---|---|---|
| Competition | Non-linear, winner-take-more | Softmax (uniform at init) |
| Memory cycle | Theta encode/retrieve | Single forward pass |
| Spatial prior | Retinotopic organisation | No spatial bias |
| Modulatory context | PFC gain modulation | Added Q,K,V projections |
| Feedback | Rich top-down connections | Decoder cross-attention only |
Current work in NeuroSynth
Our NeuroSynth project is exploring three of these principles:
- Competitive attention — replacing softmax with a normalised ReLU competition that more closely mirrors biased competition
- Oscillatory gating — introducing a learnable temporal gate that separates encoding and retrieval
- Gain modulation — implementing context-dependent multiplicative modulation of attention weights
Preliminary results on long-range dependency tasks show that oscillatory gating improves performance by up to 4.2% on long-context language modelling while reducing memory usage by 18%.
Conclusion
Biological attention is far richer than its transformer analogue. By studying the neuroscience more carefully, we can identify principled improvements: competitive dynamics, temporal structure, and gain modulation. This is not biomimicry for its own sake — it is a systematic search for better computational primitives.
Code and benchmarks for the NeuroSynth attention variants will be released on GitHub in Q2 2025.
References
- Vaswani, A. et al. (2017). Attention is all you need. NeurIPS.
- Desimone, R. & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222.
- Rao, R.P.N. & Ballard, D.H. (1999). Predictive coding in the visual cortex. Nature Neuroscience, 2, 79–87.
- Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11, 127–138.
- Lisman, J. & Jensen, O. (2013). The theta-gamma neural code. Neuron, 77, 1002–1016.
Tags