BERT

BERT (Bidirectional Encoder Representations from Transformers) is a family of transformer-based language models introduced by researchers at Google in October 2018.^[1] It was the first large-scale language model to use masked-token pretraining on a bidirectional transformer encoder, and it held the state-of-the-art on most natural-language-understanding benchmarks — including GLUE, SQuAD, and SWAG — from late 2018 until mid-2019. BERT established the now-standard "pretrain on unlabelled text, then fine-tune on a downstream task" paradigm that dominated NLP for roughly three years, and its architecture remains the basis of the encoder family (RoBERTa, ALBERT, DeBERTa, ELECTRA, DistilBERT, ModernBERT) still widely used in search, classification, and retrieval pipelines.

Although decoder-only generative models eventually displaced BERT for user-facing applications, BERT-style encoders continue to be deployed in Google Search, e-commerce ranking, and retrieval-augmented generation (RAG) systems, where fast bidirectional embedding and classification are more important than open-ended text generation.

Background

Before BERT, the two dominant approaches to pre-trained language representations were:

Feature-based approaches such as ELMo (Peters et al., 2018), which produced contextual word vectors from a bidirectional LSTM and fed them as fixed features into task-specific architectures.
Fine-tuning approaches such as OpenAI's GPT (Radford et al., June 2018), which used a left-to-right transformer decoder pre-trained on a language-modelling objective and then fine-tuned end-to-end on each downstream task.

GPT's left-to-right constraint meant that, at every position, the representation of a token depended only on the tokens to its left. The BERT authors argued this was "sub-optimal for sentence-level tasks and could be very harmful when applying fine-tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions."^[2]

BERT's key innovation was to pre-train a deep bidirectional transformer by masking a fraction of input tokens and training the model to predict them from both left and right context simultaneously.

Architecture

BERT is a stack of transformer encoder layers — the same encoder half described in "Attention Is All You Need" (Vaswani et al., 2017), with no decoder. Two sizes were released in the original paper:

BERT-Base: 12 layers, hidden size 768, 12 attention heads, ~110 million parameters.
BERT-Large: 24 layers, hidden size 1024, 16 attention heads, ~340 million parameters.

The input is a sequence of WordPiece tokens (vocabulary size 30,522 for the English cased/uncased variants), prefixed with a special [CLS] token whose final hidden state is used as a pooled sentence representation for classification. Sentence pairs are separated by a [SEP] token, and a learned segment embedding (A or B) is added to indicate which sentence each token belongs to. Positional information is supplied by learned — not sinusoidal — position embeddings, limiting the input to a maximum of 512 tokens.

Pretraining

BERT is pre-trained simultaneously on two self-supervised objectives:

Masked language modelling (MLM)

Fifteen per cent of input tokens are selected at random. Of those:

80 % are replaced with the special [MASK] token,
10 % are replaced with a random vocabulary token,
10 % are left unchanged.

The model is trained to predict the original token at every selected position using the cross-entropy loss. The 80/10/10 mixture exists because the [MASK] token never appears at fine-tuning time; always replacing selected tokens with [MASK] would create a train/test mismatch.

Next sentence prediction (NSP)

Each training example is a pair of sentences (A, B). Fifty per cent of the time B is the sentence that actually follows A in the corpus; the other 50 % it is a randomly sampled sentence. The final hidden state of the [CLS] token is passed through a two-class classifier trained to distinguish the two cases. NSP was intended to teach the model sentence-level relationships useful for question answering and natural language inference.

Later work — most notably RoBERTa (Liu et al., 2019) — showed that NSP contributes little or nothing and that removing it while training longer on more data improves downstream performance. Subsequent encoder models (ALBERT, DeBERTa, ELECTRA, ModernBERT) have dropped NSP entirely or replaced it with sentence-order prediction.

Training corpus

Pretraining used the concatenation of the BooksCorpus (~800 million words) and the text portion of English Wikipedia (~2.5 billion words), totalling roughly 3.3 billion words. BERT-Base was trained for 1 million steps on 16 TPU chips; BERT-Large for 1 million steps on 64 TPUs, taking approximately four days.

Fine-tuning

For downstream tasks BERT is fine-tuned end-to-end: the entire pre-trained network is used as the initialisation and all parameters are updated on task-specific labelled data. Typical fine-tuning uses batch size 16 or 32, learning rate between 2 × 10⁻⁵ and 5 × 10⁻⁵, and two to four epochs.

Four task patterns are supported by the original paper:

Single-sentence classification (e.g. sentiment analysis): add a linear classifier on top of the [CLS] token's final hidden state.
Sentence-pair classification (e.g. natural language inference, semantic textual similarity): feed both sentences separated by [SEP], classify on [CLS].
Extractive question answering (e.g. SQuAD): predict start and end token positions in the passage with two linear layers over every token's final hidden state.
Sequence tagging (e.g. named-entity recognition): predict a label for every token from its final hidden state.

Reception and impact

BERT was an immediate benchmark-sweep. On its release it improved the state of the art on eleven tasks, including pushing the GLUE benchmark from 72.8 to 80.5 and SQuAD v1.1 F1 from 91.7 to 93.2.^[2] Within two years nearly every major NLP paper either built on BERT, compared against it, or replaced an LSTM baseline with it.

Three weeks after the paper appeared, the authors released pre-trained weights for English, Chinese, and a 104-language multilingual variant (mBERT), all under the Apache 2.0 licence. The availability of ready-to-fine-tune weights — together with the Hugging Face transformers library (initially released November 2018 as pytorch-pretrained-BERT) — caused BERT-style fine-tuning to become the default NLP workflow almost overnight.

In October 2019 Google announced that BERT had been deployed in Google Search to improve understanding of English-language queries, affecting roughly one in ten searches, with rollout to additional languages in December 2019. This was the largest search-quality change Google had made in five years at the time.

Variants and successors

Several direct descendants of BERT are now more widely used than the original model:

RoBERTa (Liu et al., 2019, Facebook AI): same architecture, no NSP, larger batches, longer training, byte-level BPE. Consistently outperforms BERT on GLUE.
ALBERT (Lan et al., 2019): parameter-sharing across layers and factorised embeddings, yielding an 18× parameter reduction with comparable accuracy.
DistilBERT (Sanh et al., 2019): a 40 %-smaller student trained by knowledge distillation, retaining 97 % of BERT-Base's GLUE score.
ELECTRA (Clark et al., 2020): replaces MLM with replaced-token detection, a discriminative objective that is markedly more compute-efficient.
DeBERTa (He et al., 2020): disentangled attention with separate content and position vectors; long held the top of the SuperGLUE leaderboard.
ModernBERT (Warner et al., 2024): a 2024 update that applies rotary position embeddings, a longer 8,192-token context, FlashAttention, and a 2-trillion-token training mix to modernise the BERT encoder family for retrieval and classification workloads.

Decline and continuing use

From 2020 onward attention shifted toward decoder-only generative models. GPT-3 (June 2020) demonstrated that a sufficiently scaled left-to-right transformer could match or exceed fine-tuned BERT models on many tasks through in-context learning alone, without any task-specific fine-tuning. This made decoders — which can both classify and generate — strictly more useful for conversational applications, and by 2023 the public face of "large language models" was almost exclusively decoder-only.

BERT-style encoders nevertheless remain the default for:

Dense retrieval and embeddings — Sentence-BERT and its successors produce fixed-length vectors used in semantic search, deduplication, and the retriever stage of retrieval-augmented generation systems.
Text classification at scale — fine-tuned BERT-Base inference is roughly one order of magnitude cheaper than any frontier generative LLM, which matters for production moderation, routing, and ranking workloads.
Token-level tagging — named-entity recognition, part-of-speech tagging, and span extraction are more naturally formulated as per-token classification over bidirectional context than as autoregressive generation.

References

↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805.
↑ ^2.0 ^2.1 Devlin et al. (2018), §1.

External links

[1] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805.

[devlin2018-2] 2.0 ^2.1 Devlin et al. (2018), §1.

[1]

[2]