Recurrent neural network

Template:Short description A recurrent neural network (RNN) is a class of artificial neural network in which connections between units form directed cycles, allowing the network to maintain an internal state (or "memory") that persists across time steps. This architecture makes RNNs natural candidates for processing sequential data such as text, speech, and time-series signals, where the meaning of an element depends on what came before it.

RNNs dominated sequence modelling in natural language processing and speech recognition from the early 1990s until the late 2010s, when the transformer architecture displaced them for most large-scale tasks.^[1] They remain in widespread use for streaming, low-latency, and resource-constrained applications.

History

Early ideas (1982–1990)

The idea of networks with feedback loops dates to Hopfield's 1982 work on content-addressable memory,^[2] but the modern recurrent form was introduced independently by Michael I. Jordan in 1986^[3] and Jeffrey L. Elman in 1990.^[4] Elman networks added a context layer that copied the hidden state from the previous step and fed it back as input, creating what is now called a "simple" or "vanilla" RNN.

Vanishing gradients and LSTM (1991–1997)

In 1991, Sepp Hochreiter's diploma thesis identified the vanishing gradient problem: gradients computed by backpropagation through time decay (or occasionally explode) exponentially with the length of the input sequence, making it extremely difficult for vanilla RNNs to learn long-range dependencies.^[5]

Hochreiter and Jürgen Schmidhuber addressed this in 1997 with the long short-term memory (LSTM) cell, which introduces a linear "cell state" guarded by multiplicative input, output, and (in the 2000 Gers–Schmidhuber revision) forget gates.^[6] The additive cell-state update lets gradients flow over hundreds or thousands of time steps without decay.

The RNN era in NLP and speech (2013–2017)

From roughly 2013, RNNs — usually LSTMs or the simpler gated recurrent unit (GRU) introduced by Cho et al. in 2014^[7] — became the dominant architecture for:

Machine translation, via the sequence-to-sequence (seq2seq) encoder–decoder framework of Sutskever, Vinyals, and Le (2014)^[8]
Speech recognition, notably in Google's 2015 production systems^[9]
Handwriting recognition, image captioning, and music generation

The attention mechanism of Bahdanau, Cho, and Bengio (2014) was originally introduced as an augmentation to RNN encoder–decoders, allowing the decoder to look back at any encoder hidden state rather than compressing the entire input into a single vector.^[10]

Displacement by transformers (2017–present)

The transformer architecture, introduced by Vaswani et al. in 2017, removed recurrence entirely in favour of pure attention.^[1] Because transformer training can be fully parallelised over the sequence length — unlike RNNs, which must process tokens one at a time — transformers scale to much larger models and datasets. By 2020 they had displaced RNNs as the dominant architecture for most NLP tasks, and by 2022 for speech and vision as well.

Interest in recurrent architectures revived somewhat after 2023 with state-space models such as Mamba,^[11] RWKV,^[12] and xLSTM,^[13] which combine RNN-like linear inference cost with parallelisable training.

Architecture

Elman network

The simplest RNN takes input vector <math>x_t</math> at each time step <math>t</math> and maintains a hidden state <math>h_t</math>:

<math>h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b_h)</math>

where <math>\sigma</math> is a nonlinearity such as <math>\tanh</math>, and the weight matrices <math>W_{hh}, W_{xh}, W_{hy}</math> are shared across all time steps.^[4] Parameter sharing is what makes the network recurrent and allows it to process sequences of arbitrary length with a fixed parameter count.

Variants

Jordan network — recurrence runs from output back to input rather than hidden-to-hidden.
Bidirectional RNN (Schuster & Paliwal, 1997)^[14] — two independent RNNs run left-to-right and right-to-left; their hidden states are concatenated, giving each position context from both directions.
Deep RNN — multiple recurrent layers are stacked, with the hidden state of each layer fed as input to the next.
LSTM — replaces the simple hidden-state update with gated cell-state arithmetic; the default choice when vanilla RNNs fail to learn.
Gated recurrent unit (GRU) — a simplified LSTM with only two gates (reset and update) and no separate cell state; often matches LSTM quality with roughly 25% fewer parameters.^[7]
Echo state network and liquid state machine — reservoir-computing variants in which only the readout layer is trained.

Training

RNNs are trained with backpropagation through time (BPTT): the recurrent computation is "unrolled" into a feed-forward graph of length <math>T</math>, and standard backpropagation is applied.^[15] In practice BPTT is usually truncated to a window of 50–200 steps to bound memory use.

The exploding gradient counterpart of the vanishing problem is typically addressed by gradient clipping — rescaling any gradient whose norm exceeds a threshold — as proposed by Pascanu, Mikolov, and Bengio in 2013.^[16]

Applications

Despite transformer dominance in large-model work, RNNs remain competitive or preferred where:

Streaming inference is required. An RNN emits output token-by-token at constant cost per step; an attention-based model's cost grows with context length, making long-context streaming expensive. This keeps LSTMs prevalent in on-device automatic speech recognition and real-time captioning.
Parameters are scarce. Small RNNs (<10 MB) outperform similarly sized transformers on many time-series forecasting, sensor-fusion, and edge-deployment tasks.
Sequence length is very long or unbounded. Linear recurrent models (Mamba, RWKV) process arbitrarily long sequences at O(1) memory, unlike standard attention's O(n²).

Other long-standing applications include music generation, protein structure prediction before AlphaFold, handwriting synthesis, financial time-series modelling, and classical reinforcement learning policies with partial observability.

Limitations

Sequential computation. Each hidden state depends on the previous one, so RNN training cannot be parallelised across the time dimension. This is the single most important reason transformers scale better.
Vanishing/exploding gradients. Even with LSTM or GRU cells, information retention over thousands of steps is imperfect.
Limited context compression. A fixed-size hidden state is a hard bottleneck when the relevant context is very large and diverse, which is why attention — and then pure-attention transformers — were initially introduced.

References

↑ ^1.0 ^1.1 Vaswani, A. et al. (2017). "Attention Is All You Need". Advances in Neural Information Processing Systems 30. Template:ArXiv.
↑ Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences 79 (8): 2554–2558.
↑ Jordan, M. I. (1986). "Serial Order: A Parallel Distributed Processing Approach". ICS Report 8604, University of California, San Diego.
↑ ^4.0 ^4.1 Elman, J. L. (1990). "Finding Structure in Time". Cognitive Science 14 (2): 179–211.
↑ Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen". Diploma thesis, Technical University of Munich.
↑ Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation 9 (8): 1735–1780.
↑ ^7.0 ^7.1 Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". Template:ArXiv.
↑ Sutskever, I.; Vinyals, O.; Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". Template:ArXiv.
↑ Sak, H.; Senior, A.; Beaufays, F. (2014). "Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition". Template:ArXiv.
↑ Bahdanau, D.; Cho, K.; Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". Template:ArXiv.
↑ Gu, A.; Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". Template:ArXiv.
↑ Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era". Template:ArXiv.
↑ Beck, M. et al. (2024). "xLSTM: Extended Long Short-Term Memory". Template:ArXiv.
↑ Schuster, M.; Paliwal, K. K. (1997). "Bidirectional recurrent neural networks". IEEE Transactions on Signal Processing 45 (11): 2673–2681.
↑ Werbos, P. J. (1990). "Backpropagation through time: what it does and how to do it". Proceedings of the IEEE 78 (10): 1550–1560.
↑ Pascanu, R.; Mikolov, T.; Bengio, Y. (2013). "On the difficulty of training recurrent neural networks". Proceedings of the 30th International Conference on Machine Learning.

[vaswani2017-1] 1.0 ^1.1 Vaswani, A. et al. (2017). "Attention Is All You Need". Advances in Neural Information Processing Systems 30. Template:ArXiv.

[2] Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences 79 (8): 2554–2558.

[3] Jordan, M. I. (1986). "Serial Order: A Parallel Distributed Processing Approach". ICS Report 8604, University of California, San Diego.

[elman1990-4] 4.0 ^4.1 Elman, J. L. (1990). "Finding Structure in Time". Cognitive Science 14 (2): 179–211.

[5] Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen". Diploma thesis, Technical University of Munich.

[hochreiter1997-6] Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation 9 (8): 1735–1780.

[cho2014-7] 7.0 ^7.1 Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". Template:ArXiv.

[8] Sutskever, I.; Vinyals, O.; Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". Template:ArXiv.

[9] Sak, H.; Senior, A.; Beaufays, F. (2014). "Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition". Template:ArXiv.

[bahdanau2014-10] Bahdanau, D.; Cho, K.; Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". Template:ArXiv.

[11] Gu, A.; Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". Template:ArXiv.

[12] Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era". Template:ArXiv.

[13] Beck, M. et al. (2024). "xLSTM: Extended Long Short-Term Memory". Template:ArXiv.

[14] Schuster, M.; Paliwal, K. K. (1997). "Bidirectional recurrent neural networks". IEEE Transactions on Signal Processing 45 (11): 2673–2681.

[15] Werbos, P. J. (1990). "Backpropagation through time: what it does and how to do it". Proceedings of the IEEE 78 (10): 1550–1560.

[16] Pascanu, R.; Mikolov, T.; Bengio, Y. (2013). "On the difficulty of training recurrent neural networks". Proceedings of the 30th International Conference on Machine Learning.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]