Long short-term memory

From OpenEncyclopedia

Template:Short description

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) architecture designed to learn long-range dependencies in sequential data. Introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, LSTM replaces the plain RNN hidden unit with a gated memory cell whose internal state can be preserved, updated, or cleared over many time steps without the exploding or vanishing gradients that cripple standard RNNs. For roughly two decades LSTM was the dominant sequence-modelling architecture, powering production systems for machine translation, speech recognition, handwriting recognition, time-series forecasting, and language modelling, until it was displaced by transformer-based models from 2017 onward.

History

The vanishing-gradient problem (1991)

In his 1991 diploma thesis at Technische Universität München, Sepp Hochreiter analysed the difficulty of training recurrent networks by backpropagation through time and showed that gradients propagated across many time steps tend to either vanish exponentially (losing all information) or explode (making training unstable). This result, independently highlighted by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994, established that ordinary RNNs are in practice unable to learn dependencies spanning more than about ten time steps.

The 1997 paper

Hochreiter and Schmidhuber published "Long Short-Term Memory" in Neural Computation in November 1997. The key innovation was the constant error carousel (CEC): a self-connected linear unit with weight 1.0 whose derivative remains exactly 1.0 during backpropagation, causing the error signal to flow backward through time without decay. To prevent uncontrolled accumulation, the CEC was surrounded by multiplicative gates — initially an input gate and an output gate — that learn when to let new information in and when to read from the cell.

The forget gate (1999)

The original 1997 cell had no mechanism to clear the CEC, which could become saturated on long streams. Felix Gers, Schmidhuber, and Fred Cummins added the forget gate in "Learning to Forget: Continual Prediction with LSTM" (1999, Neural Computation 2000). The three-gate cell (input / forget / output) is the version now universally referred to as "LSTM" and is present in every major deep-learning library.

Peephole connections and variants (2000–2015)

Gers and Schmidhuber (2000) added peephole connections allowing the gates to inspect the cell state directly. Alex Graves popularised bidirectional LSTM (BLSTM, with Schmidhuber, 2005) and connectionist temporal classification (CTC), enabling end-to-end speech and handwriting recognition. A 2015 empirical survey by Klaus Greff and colleagues ("LSTM: A Search Space Odyssey") tested eight variants and found that the forget gate and output activation were the components that mattered most; peepholes and the output gate contributed little to standard tasks.

Industrial adoption (2013–2017)

LSTM became mainstream when deep-learning frameworks made it accessible and GPU training made it tractable. Notable production deployments included:

  • Google Voice (2015) — LSTM-based acoustic models halved the word-error rate of the previous HMM–GMM system.
  • Google Translate (2016) — Google's Neural Machine Translation system (GNMT), built on stacked LSTM encoders and decoders with attention, replaced the previous phrase-based statistical system and reduced translation errors by an average of 60%.
  • Apple Siri, Amazon Alexa, Microsoft Cortana — all used LSTM-based components for speech recognition and language understanding during the mid-2010s.
  • Facebook reported in 2017 that LSTMs processed around 4.5 billion translations per day on its platform.

Displacement by transformers (2017–present)

The 2017 paper Attention Is All You Need introduced the transformer, which replaced recurrence with self-attention. Transformers parallelise across the sequence dimension — something LSTMs cannot do, because each time step depends on the previous hidden state — making them far more efficient on modern GPU and TPU hardware. Within five years transformers had displaced LSTMs in essentially every large-scale NLP task. LSTMs remain in use for streaming applications, edge inference, very long sequences where the quadratic cost of attention is prohibitive, and certain time-series problems where inductive bias toward recency helps.

xLSTM (2024)

In May 2024 Sepp Hochreiter and colleagues published xLSTM, an "extended LSTM" that introduces exponential gating and two new cell variants (sLSTM with scalar memory and mLSTM with a matrix memory that can be trained in parallel). On language-modelling benchmarks the authors report that xLSTM scales competitively with comparably-sized transformers and state-space models such as Mamba, partially re-opening the question of whether recurrence can remain competitive at large scale.

Architecture

An LSTM layer processes an input sequence <math>x_1, x_2, \ldots, x_T</math> one step at a time. At each step <math>t</math> the layer maintains two state vectors: the cell state <math>c_t</math> (the long-term memory) and the hidden state <math>h_t</math> (the output exposed to the next layer and the next time step).

The standard three-gate cell performs the following computation at each time step:

<math>f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)</math>    (forget gate)
<math>i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)</math>    (input gate)
<math>\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)</math>    (candidate cell update)
<math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>    (new cell state)
<math>o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)</math>    (output gate)
<math>h_t = o_t \odot \tanh(c_t)</math>    (new hidden state)

Here <math>\sigma</math> is the logistic sigmoid (values in (0, 1)), <math>\odot</math> is element-wise multiplication, and <math>[h_{t-1}, x_t]</math> denotes concatenation of the previous hidden state and the current input.

Intuition

  • The forget gate <math>f_t</math> produces a vector close to 1 where the previous cell contents should be kept and close to 0 where they should be erased. Because <math>c_t</math> combines <math>f_t \odot c_{t-1}</math> additively, information that the gate chooses to keep flows forward essentially unchanged — the constant-error-carousel property.
  • The input gate <math>i_t</math> decides how much of the candidate update <math>\tilde{c}_t</math> is written into the cell.
  • The output gate <math>o_t</math> decides how much of the cell state is exposed as the hidden state read by downstream layers.

Parameter count

For an input of dimension <math>d_x</math> and a hidden state of dimension <math>d_h</math>, each of the four gates owns a weight matrix of shape <math>d_h \times (d_x + d_h)</math> plus a bias vector, giving a total of <math>4 (d_h (d_x + d_h) + d_h)</math> parameters per layer — roughly four times a plain RNN of the same size.

Training

LSTMs are trained by backpropagation through time, using any gradient-based optimiser (in practice Adam or RMSProp are the most common). Because the constant-error-carousel path carries gradients multiplicatively by the forget-gate values, vanishing gradients are mitigated but not eliminated: if <math>f_t</math> is consistently much less than 1, information still decays. Exploding gradients can still occur through the non-linear paths and are usually controlled by gradient clipping (scaling the gradient vector whenever its norm exceeds a threshold).

Practical guidance that emerged during the LSTM era includes:

  • Forget-gate bias initialisation to 1 (Jozefowicz et al. 2015) — starting with the forget gate mostly "open" speeds up learning significantly.
  • Orthogonal initialisation of the recurrent weight matrices improves gradient flow at initialisation.
  • Dropout should be applied to the non-recurrent connections only (Zaremba et al. 2014) or to a shared mask across time steps (Gal & Ghahramani 2016, "variational dropout") — applying independent dropout to the recurrent path destroys the memory.
  • Layer normalisation (Ba et al. 2016) was developed in part to stabilise LSTM training and typically works better inside RNNs than batch normalisation.

Variants

Gated recurrent unit

The gated recurrent unit (GRU), introduced by Kyunghyun Cho and colleagues in 2014, merges the forget and input gates into a single update gate and dispenses with the separate cell state, leaving only the hidden state. A GRU has three gates' worth fewer parameters than an LSTM of the same hidden size and often matches LSTM performance on small-to-medium tasks while training faster.

Bidirectional LSTM

A bidirectional LSTM (BLSTM) runs two independent LSTMs over the input, one left-to-right and one right-to-left, and concatenates their hidden states. This doubles the parameter count and latency but gives each output position access to both past and future context. BLSTMs were standard in speech recognition, named-entity recognition, and other offline sequence-labelling tasks prior to transformers.

Peephole LSTM

Adds direct connections from the cell state into the gate computations, letting the gates condition on <math>c_{t-1}</math> (or <math>c_t</math> for the output gate) in addition to <math>h_{t-1}</math> and <math>x_t</math>. Useful for tasks requiring precise timing, such as counting or recognising periodic patterns, but the extra parameters rarely help on standard NLP benchmarks.

ConvLSTM

Replaces the dense gate matrix multiplications with 2-D convolutions so the cell state is a feature map rather than a vector. Introduced by Xingjian Shi and colleagues in 2015 for precipitation nowcasting, ConvLSTM is used where both spatial structure and temporal dynamics matter, such as video prediction and weather modelling.

Tree-LSTM

Extends the LSTM cell to branching tree structures (Tai, Socher & Manning, 2015), summing contributions from multiple children rather than a single previous step. Popular in semantic parsing and in constituency-tree sentiment analysis.

xLSTM

See History § xLSTM (2024) above. The sLSTM variant adds exponential gating and a normaliser state to a scalar-memory cell; the mLSTM variant replaces the scalar cell with a matrix and a covariance-style update rule that can be unrolled in parallel across the sequence, closing the single largest throughput gap between LSTMs and transformers.

Applications

Natural language processing

Before transformers, LSTMs were the standard architecture for:

  • Language modelling (Mikolov et al. 2010–2012 established neural language models; LSTM variants dominated 2014–2018).
  • Neural machine translation — the encoder–decoder with attention pattern (Bahdanau et al. 2014; Sutskever et al. 2014 "Sequence to Sequence Learning with Neural Networks") used LSTMs in both roles.
  • Named-entity recognition, part-of-speech tagging, chunking — BLSTM-CRF (Huang, Xu & Yu, 2015) was the dominant approach before contextual embeddings.
  • Text classification, sentiment analysis, reading comprehension.
  • ELMo (Peters et al. 2018), the last major pre-transformer contextual embedding system, used bidirectional LSTMs.

Speech and audio

  • Acoustic modelling in large-vocabulary speech recognition (replacing HMM–GMM systems from about 2013 onward).
  • End-to-end speech recognition with CTC loss or attention-based sequence-to-sequence models.
  • Speaker identification, voice-activity detection, language identification.
  • Text-to-speech: Tacotron (Wang et al. 2017) used LSTMs throughout its encoder–decoder architecture.

Vision and video

  • Image captioning: Show and Tell (Vinyals et al. 2015) fed a convolutional-network image embedding to an LSTM language model.
  • Action recognition in video — LSTMs consume per-frame CNN features (Donahue et al. 2015).
  • Optical character recognition, especially for cursive and multi-line scripts, using BLSTM with CTC.

Time-series and control

  • Financial forecasting and algorithmic trading.
  • Demand forecasting, electricity-load prediction, anomaly detection in sensor streams.
  • Robotics and reinforcement learning — LSTMs give agents memory over partially observable environments and were used in the DeepMind systems that played Atari, Dota 2 (OpenAI Five used LSTMs with 158,976 hidden units), and StarCraft II (AlphaStar).

Bioinformatics

  • Protein secondary-structure prediction, protein-function annotation (precursor models to AlphaFold).
  • Genomic sequence modelling, variant calling.
  • Clinical event prediction from electronic health records.

Relation to other architectures

Architecture Path for long-range gradient Parallelisable across sequence? Memory of past
Plain RNN Multiplicative through tanh → vanishes/explodes No Implicit in <math>h_t</math>
LSTM / GRU Additive through gated cell → stable No Explicit in <math>c_t</math>
Transformer Direct attention from any position to any other Yes Full sequence held in memory
State-space model (Mamba, S4) Selective linear recurrence Parallel scan Compressed in state

LSTMs are often described as the first practical solution to the long-range-dependency problem in sequence modelling. Transformers solve the same problem with a different trade-off: unbounded context access and parallel training, at the cost of memory and compute that scale quadratically with sequence length.

Legacy

Although LSTM is no longer state-of-the-art for most large-scale NLP tasks, the concepts it introduced — gated memory, constant-error carousels, forget mechanisms — are now standard components of many architectures including gated convolutional networks, highway networks, ResNets (whose identity skip-connections can be viewed as a degenerate form of the forget-gate-open state), and modern state-space models. Sepp Hochreiter has continued to argue publicly that recurrence is not obsolete and that the main engineering disadvantage of LSTMs — their sequential training — has now been addressed by the xLSTM matrix variant.

As of 2024 the 1997 LSTM paper has been cited more than 100,000 times according to Google Scholar, placing it among the most-cited computer-science papers of the past three decades.

See also

References

  • Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735.
  • Gers, F. A., Schmidhuber, J. & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM". Neural Computation 12 (10): 2451–2471.
  • Graves, A. & Schmidhuber, J. (2005). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks 18 (5–6): 602–610.
  • Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey". IEEE Transactions on Neural Networks and Learning Systems 28 (10): 2222–2232.
  • Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures". ICML 2015.
  • Sutskever, I., Vinyals, O. & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks". NeurIPS 2014.
  • Wu, Y. et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv:1609.08144.
  • Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J. & Hochreiter, S. (2024). "xLSTM: Extended Long Short-Term Memory". arXiv:2405.04517.
  • Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen" (Diploma thesis). Technische Universität München.