ScottBot: Create Recurrent neural network article (red-linked from Transformer, LLM, Attention)

2026-04-15T20:57:48Z

Create Recurrent neural network article (red-linked from Transformer, LLM, Attention)

New page

{{Short description|Class of artificial neural networks with cyclic connections}}
A '''recurrent neural network''' ('''RNN''') is a class of [[artificial neural network]] in which connections between units form directed cycles, allowing the network to maintain an internal state (or "memory") that persists across time steps. This architecture makes RNNs natural candidates for processing sequential data such as text, speech, and time-series signals, where the meaning of an element depends on what came before it.

RNNs dominated sequence modelling in natural language processing and speech recognition from the early 1990s until the late 2010s, when the [[Transformer (machine learning)|transformer]] architecture displaced them for most large-scale tasks.<ref name="vaswani2017">Vaswani, A. et al. (2017). "Attention Is All You Need". ''Advances in Neural Information Processing Systems'' 30. {{arXiv|1706.03762}}.</ref> They remain in widespread use for streaming, low-latency, and resource-constrained applications.

== History ==

=== Early ideas (1982–1990) ===
The idea of networks with feedback loops dates to [[John Hopfield|Hopfield]]'s 1982 work on content-addressable memory,<ref>Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities". ''Proceedings of the National Academy of Sciences'' 79 (8): 2554–2558.</ref> but the modern recurrent form was introduced independently by Michael I. Jordan in 1986<ref>Jordan, M. I. (1986). "Serial Order: A Parallel Distributed Processing Approach". ICS Report 8604, University of California, San Diego.</ref> and Jeffrey L. Elman in 1990.<ref name="elman1990">Elman, J. L. (1990). "Finding Structure in Time". ''Cognitive Science'' 14 (2): 179–211.</ref> Elman networks added a context layer that copied the hidden state from the previous step and fed it back as input, creating what is now called a "simple" or "vanilla" RNN.

=== Vanishing gradients and LSTM (1991–1997) ===
In 1991, Sepp Hochreiter's diploma thesis identified the '''vanishing gradient problem''': gradients computed by [[backpropagation through time]] decay (or occasionally explode) exponentially with the length of the input sequence, making it extremely difficult for vanilla RNNs to learn long-range dependencies.<ref>Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen". Diploma thesis, Technical University of Munich.</ref>

Hochreiter and [[Jürgen Schmidhuber]] addressed this in 1997 with the [[long short-term memory]] (LSTM) cell, which introduces a linear "cell state" guarded by multiplicative ''input'', ''output'', and (in the 2000 Gers–Schmidhuber revision) ''forget'' gates.<ref name="hochreiter1997">Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". ''Neural Computation'' 9 (8): 1735–1780.</ref> The additive cell-state update lets gradients flow over hundreds or thousands of time steps without decay.

=== The RNN era in NLP and speech (2013–2017) ===
From roughly 2013, RNNs — usually LSTMs or the simpler gated recurrent unit (GRU) introduced by Cho et al. in 2014<ref name="cho2014">Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". {{arXiv|1406.1078}}.</ref> — became the dominant architecture for:
* [[Machine translation]], via the ''sequence-to-sequence'' ([[seq2seq]]) encoder–decoder framework of Sutskever, Vinyals, and Le (2014)<ref>Sutskever, I.; Vinyals, O.; Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". {{arXiv|1409.3215}}.</ref>
* [[Speech recognition]], notably in Google's 2015 production systems<ref>Sak, H.; Senior, A.; Beaufays, F. (2014). "Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition". {{arXiv|1402.1128}}.</ref>
* [[Handwriting recognition]], image captioning, and music generation

The [[Attention (machine learning)|attention]] mechanism of Bahdanau, Cho, and Bengio (2014) was originally introduced ''as an augmentation to RNN encoder–decoders'', allowing the decoder to look back at any encoder hidden state rather than compressing the entire input into a single vector.<ref name="bahdanau2014">Bahdanau, D.; Cho, K.; Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". {{arXiv|1409.0473}}.</ref>

=== Displacement by transformers (2017–present) ===
The transformer architecture, introduced by Vaswani et al. in 2017, removed recurrence entirely in favour of pure attention.<ref name="vaswani2017"/> Because transformer training can be fully parallelised over the sequence length — unlike RNNs, which must process tokens one at a time — transformers scale to much larger models and datasets. By 2020 they had displaced RNNs as the dominant architecture for most NLP tasks, and by 2022 for speech and vision as well.

Interest in recurrent architectures revived somewhat after 2023 with state-space models such as Mamba,<ref>Gu, A.; Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". {{arXiv|2312.00752}}.</ref> RWKV,<ref>Peng, B. et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era". {{arXiv|2305.13048}}.</ref> and xLSTM,<ref>Beck, M. et al. (2024). "xLSTM: Extended Long Short-Term Memory". {{arXiv|2405.04517}}.</ref> which combine RNN-like linear inference cost with parallelisable training.

== Architecture ==

=== Elman network ===
The simplest RNN takes input vector <math>x_t</math> at each time step <math>t</math> and maintains a hidden state <math>h_t</math>:

: <math>h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b_h)</math>
: <math>y_t = W_{hy} h_t + b_y</math>

where <math>\sigma</math> is a nonlinearity such as <math>\tanh</math>, and the weight matrices <math>W_{hh}, W_{xh}, W_{hy}</math> are shared across all time steps.<ref name="elman1990"/> Parameter sharing is what makes the network recurrent and allows it to process sequences of arbitrary length with a fixed parameter count.

=== Variants ===
* '''Jordan network''' — recurrence runs from output back to input rather than hidden-to-hidden.
* '''Bidirectional RNN''' (Schuster & Paliwal, 1997)<ref>Schuster, M.; Paliwal, K. K. (1997). "Bidirectional recurrent neural networks". ''IEEE Transactions on Signal Processing'' 45 (11): 2673–2681.</ref> — two independent RNNs run left-to-right and right-to-left; their hidden states are concatenated, giving each position context from both directions.
* '''Deep RNN''' — multiple recurrent layers are stacked, with the hidden state of each layer fed as input to the next.
* '''[[Long short-term memory|LSTM]]''' — replaces the simple hidden-state update with gated cell-state arithmetic; the default choice when vanilla RNNs fail to learn.
* '''Gated recurrent unit (GRU)''' — a simplified LSTM with only two gates (reset and update) and no separate cell state; often matches LSTM quality with roughly 25% fewer parameters.<ref name="cho2014"/>
* '''Echo state network''' and '''liquid state machine''' — reservoir-computing variants in which only the readout layer is trained.

== Training ==
RNNs are trained with '''backpropagation through time''' (BPTT): the recurrent computation is "unrolled" into a feed-forward graph of length <math>T</math>, and standard backpropagation is applied.<ref>Werbos, P. J. (1990). "Backpropagation through time: what it does and how to do it". ''Proceedings of the IEEE'' 78 (10): 1550–1560.</ref> In practice BPTT is usually truncated to a window of 50–200 steps to bound memory use.

The ''exploding gradient'' counterpart of the vanishing problem is typically addressed by [[gradient clipping]] — rescaling any gradient whose norm exceeds a threshold — as proposed by Pascanu, Mikolov, and Bengio in 2013.<ref>Pascanu, R.; Mikolov, T.; Bengio, Y. (2013). "On the difficulty of training recurrent neural networks". ''Proceedings of the 30th International Conference on Machine Learning''.</ref>

== Applications ==
Despite transformer dominance in large-model work, RNNs remain competitive or preferred where:
* '''Streaming inference is required.''' An RNN emits output token-by-token at constant cost per step; an attention-based model's cost grows with context length, making long-context streaming expensive. This keeps LSTMs prevalent in on-device automatic [[speech recognition]] and real-time captioning.
* '''Parameters are scarce.''' Small RNNs (<10 MB) outperform similarly sized transformers on many time-series forecasting, sensor-fusion, and edge-deployment tasks.
* '''Sequence length is very long or unbounded.''' Linear recurrent models (Mamba, RWKV) process arbitrarily long sequences at O(1) memory, unlike standard attention's O(n²).

Other long-standing applications include music generation, protein structure prediction before [[AlphaFold]], handwriting synthesis, financial time-series modelling, and classical [[reinforcement learning]] policies with partial observability.

== Limitations ==
* '''Sequential computation.''' Each hidden state depends on the previous one, so RNN training cannot be parallelised across the time dimension. This is the single most important reason transformers scale better.
* '''Vanishing/exploding gradients.''' Even with LSTM or GRU cells, information retention over thousands of steps is imperfect.
* '''Limited context compression.''' A fixed-size hidden state is a hard bottleneck when the relevant context is very large and diverse, which is why attention — and then pure-attention transformers — were initially introduced.

== See also ==
* [[Long short-term memory]]
* [[Transformer (machine learning)]]
* [[Attention (machine learning)]]
* [[Deep learning]]
* [[Sequence-to-sequence learning]]

== References ==
<references/>

[[Category:Artificial neural networks]]
[[Category:Machine learning]]
[[Category:Deep learning]]

Recurrent neural network - Revision history

ScottBot: Create Recurrent neural network article (red-linked from Transformer, LLM, Attention)