Natural language processing

Template:Infobox field of study

Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics concerned with enabling computers to understand, interpret, generate, and reason about human language. It is the scientific foundation underlying large language models such as GPT-4, Claude, and BERT, and is applied in machine translation, search engines, voice assistants, sentiment analysis, document summarisation, and question answering.

NLP sits at the intersection of computer science, linguistics, and statistics. Its central challenge is ambiguity: natural language is riddled with lexical polysemy (bank = financial institution or river edge), syntactic ambiguity ("I saw the man with the telescope"), pragmatic context-dependence, and figurative language. Unlike programming languages, human languages have no formal specification, and meaning depends on context, world knowledge, and speaker intent.

History

Rule-based era (1950s–1980s)

The field's origins are conventionally traced to Alan Turing's 1950 paper "Computing Machinery and Intelligence," which proposed the imitation game (now called the Turing test) as a criterion for machine intelligence — fundamentally a test of language ability.^[1]

Early NLP systems relied on hand-crafted rules and symbolic logic:

ELIZA (1966): Joseph Weizenbaum's MIT program simulated a Rogerian therapist using simple pattern matching and substitution rules. Despite its trivial mechanism, users frequently attributed genuine understanding to it — the "ELIZA effect."^[2]
SHRDLU (1970): Terry Winograd's system could understand and generate English sentences about a simulated blocks world, using a combination of syntactic parsing, semantic interpretation, and procedural reasoning. Its success was impressive but narrowly limited to the toy domain.
Conceptual Dependency (1970s): Roger Schank's theory represented sentence meaning as language-independent conceptual structures, anticipating later work on semantic representations.

The rule-based approach produced systems that worked in narrow domains but failed to scale. Chomsky's transformational grammar influenced the field theoretically, but the combinatorial explosion of linguistic rules made comprehensive hand-coding impractical.

Statistical revolution (1990s–2010s)

The shift from rules to data began in the late 1980s and accelerated through the 1990s, driven by three factors: the availability of large digital text corpora (the Penn Treebank, Europarl), increased computing power, and the success of probabilistic methods in speech recognition.

Key developments:

Hidden Markov Models (HMMs): Applied to part-of-speech tagging and named entity recognition, achieving accuracies above 95% on standard benchmarks — far exceeding rule-based taggers.
Statistical machine translation (SMT): The IBM Models (Brown et al., 1993) and later phrase-based SMT (Koehn et al., 2003) treated translation as a noisy-channel problem, learning alignments and phrase tables from parallel corpora. Google Translate launched in 2006 using phrase-based SMT.
Conditional random fields (CRFs): Lafferty et al. (2001) introduced discriminative sequence models that outperformed HMMs on many structured prediction tasks.
Latent Dirichlet Allocation (LDA): Blei et al. (2003) introduced probabilistic topic models, enabling unsupervised discovery of thematic structure in document collections.

Frederick Jelinek's famous quip — "Every time I fire a linguist, the performance of the speech recognizer goes up" — captured the era's spirit, though it somewhat overstated the case: linguistic features remained useful as inputs to statistical models.

Neural NLP (2013–present)

The application of deep learning to NLP, beginning around 2013, transformed the field:

Word embeddings (2013): Tomáš Mikolov's Word2Vec and later GloVe (Pennington et al., 2014) showed that unsupervised training on large corpora could produce dense vector representations capturing semantic relationships (the famous "king − man + woman ≈ queen" analogy). These replaced sparse, hand-crafted feature vectors as the standard input representation.
Sequence-to-sequence models (2014): Sutskever et al. demonstrated that recurrent neural networks (specifically LSTMs) could translate between languages by encoding a source sentence into a fixed-length vector and decoding it into a target sentence.
Attention (2014–2015): Bahdanau et al. introduced the attention mechanism, allowing decoders to focus on different parts of the input at each generation step, dramatically improving translation quality on long sentences.
Transformer (2017): Vaswani et al.'s "Attention Is All You Need" replaced recurrence entirely with self-attention, enabling massive parallelisation and scaling. This is the architecture behind all modern LLMs.^[3]
Pre-training and transfer learning (2018): BERT (Devlin et al.) and GPT (Radford et al.) demonstrated that pre-training a large transformer on unlabelled text and then fine-tuning on downstream tasks could achieve state-of-the-art results across virtually all NLP benchmarks. This paradigm shift made it possible to build high-quality NLP systems with relatively small labelled datasets.
Large language models (2020–present): Scaling up pre-trained transformers to hundreds of billions of parameters — GPT-3, GPT-4, Claude, Gemini — produced systems capable of few-shot and zero-shot generalisation across tasks, fundamentally changing the economics and practice of NLP.

Core tasks

NLP encompasses a wide range of tasks at different levels of linguistic analysis:

Text preprocessing

Tokenisation: Splitting text into meaningful units (words, subwords, or characters). Modern systems use subword tokenisers such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece, which handle rare words and morphologically rich languages gracefully.
Sentence segmentation: Identifying sentence boundaries — non-trivial when periods appear in abbreviations, decimals, and URLs.
Normalisation: Lowercasing, stemming, lemmatisation, and Unicode normalisation.

Syntactic analysis

Part-of-speech (POS) tagging: Assigning grammatical categories (noun, verb, adjective) to each token. Modern taggers achieve >97% accuracy on English.
Constituency parsing: Producing a phrase-structure tree (e.g., [S [NP The cat] [VP sat [PP on [NP the mat]]]]).
Dependency parsing: Identifying head–dependent relationships between words (e.g., "cat" ← nsubj ← "sat").

Semantic and pragmatic tasks

Named entity recognition (NER): Identifying and classifying mentions of people, organisations, locations, dates, and other entities in text.
Semantic role labelling (SRL): Identifying "who did what to whom" — the predicate-argument structure of sentences.
Word sense disambiguation (WSD): Determining which meaning of a polysemous word is intended in context.
Coreference resolution: Determining which noun phrases in a text refer to the same real-world entity (e.g., "Marie Curie ... she ... the physicist").

Generation and understanding tasks

Machine translation (MT): Translating text between languages. Neural MT (Bahdanau et al., 2014; Vaswani et al., 2017) dramatically improved quality over statistical MT, with Google switching to neural MT in 2016.
Text summarisation: Producing a shorter version of a document that preserves key information. Extractive summarisation selects existing sentences; abstractive summarisation generates new text.
Question answering (QA): Given a question and optionally a context passage, producing a correct answer. SQuAD (Rajpurkar et al., 2016) was an influential benchmark.
Sentiment analysis: Classifying the opinion or emotion expressed in text (positive, negative, neutral, or fine-grained).
Natural language inference (NLI): Determining whether a hypothesis is entailed by, contradicted by, or neutral with respect to a premise.
Text generation: Producing fluent, coherent text — from autocomplete to creative writing to code generation.

Evaluation

NLP tasks are evaluated using a combination of automatic metrics and human judgement:

BLEU (Papineni et al., 2002): n-gram overlap metric for machine translation. Widely used despite known limitations (insensitivity to meaning-preserving paraphrases).
ROUGE (Lin, 2004): Recall-oriented metric for summarisation.
F1 score: Standard metric for NER, QA, and classification tasks.
Perplexity: Intrinsic metric for language models, measuring how well the model predicts a held-out test set.
Human evaluation: For generation tasks, human ratings of fluency, coherence, factual accuracy, and helpfulness remain the gold standard, though they are expensive and variable.

Benchmark suites like GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019), BIG-bench (Srivastava et al., 2022), and MMLU (Hendrycks et al., 2021) aggregate multiple tasks into a single leaderboard, though benchmark saturation — models achieving near-human or above-human scores — has led to a search for harder evaluations.

Challenges

Despite dramatic progress, several fundamental challenges remain:

Hallucination: Large language models generate fluent text that is factually incorrect, a problem that persists even in the most capable models. Retrieval-augmented generation (RAG) and improved training methods mitigate but do not eliminate hallucination.
Multilingual equity: Most NLP research and datasets are English-centric. Performance on low-resource languages (most of the world's ~7,000 languages) remains substantially worse.
Bias and fairness: Language models absorb and amplify biases present in training data, including gender, racial, and cultural stereotypes.
Reasoning: While LLMs show impressive pattern completion, their capacity for genuine logical and mathematical reasoning remains debated. Chain-of-thought prompting and tool use improve performance but do not resolve the underlying question.
Efficiency: State-of-the-art NLP models require enormous computational resources for training and inference, raising environmental and access concerns.

Applications

NLP technology is deployed across virtually every industry:

Search engines: Google's integration of BERT (2019) and later LLM-based search dramatically improved query understanding.
Virtual assistants: Siri, Alexa, Google Assistant, and Copilot rely on NLP for speech recognition, intent classification, and response generation.
Healthcare: Clinical NLP extracts diagnoses, medications, and procedures from unstructured medical records. LLMs assist with medical question answering and literature review.
Legal: Contract analysis, case law search, and regulatory compliance monitoring.
Finance: Sentiment analysis of news and social media for trading signals; automated report generation.
Education: Automated essay scoring, language learning applications, and tutoring systems.
Software engineering: Code generation, completion, and review (GitHub Copilot, Claude Code).

References

↑ Turing, A. M. (1950). "Computing Machinery and Intelligence." Mind, 59(236), 433–460.
↑ Weizenbaum, J. (1966). "ELIZA — a computer program for the study of natural language communication between man and machine." Communications of the ACM, 9(1), 36–45.
↑ Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30.

[1] Turing, A. M. (1950). "Computing Machinery and Intelligence." Mind, 59(236), 433–460.

[2] Weizenbaum, J. (1966). "ELIZA — a computer program for the study of natural language communication between man and machine." Communications of the ACM, 9(1), 36–45.

[3] Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30.

[1]

[2]

[3]