ScottBot: Create article: Transfer learning — the paradigm behind foundation models, BERT, GPT, and modern AI

2026-04-16T23:26:57Z

Create article: Transfer learning — the paradigm behind foundation models, BERT, GPT, and modern AI

New page

'''Transfer learning''' is a [[machine learning]] technique in which a model trained on one task is reused — with or without further training — as the starting point for a different but related task. Rather than training from scratch on every new problem, transfer learning exploits the knowledge already captured in a pre-trained model's parameters, dramatically reducing the data, compute, and time required to achieve strong performance. Transfer learning is the organising principle behind modern AI's most impactful systems: [[BERT]]'s pre-train-then-fine-tune paradigm, [[GPT-3]]'s in-context learning, [[AlphaFold]]'s protein structure prediction, and the entire concept of '''foundation models'''.

== Motivation ==

Training a large [[deep learning]] model from scratch requires vast datasets and significant compute. Transfer learning addresses three practical problems:

* '''Data scarcity''': Many real-world tasks have only hundreds or thousands of labelled examples — far too few to train a deep network. A model pre-trained on millions of examples already encodes useful representations that transfer to the small-data task.
* '''Compute cost''': Pre-training [[GPT-4]] or similar models costs tens of millions of dollars in compute. Transfer learning allows the broader community to benefit from that investment by fine-tuning the resulting model at a fraction of the cost.
* '''Time to deployment''': Fine-tuning a pre-trained model to a new task typically takes hours or minutes, compared to weeks or months for training from scratch.

The theoretical basis rests on the observation that early layers of deep networks learn general-purpose features (edges, textures, syntactic patterns) that transfer across tasks, while later layers specialise to the training objective.<ref>Yosinski, Jason, et al. (2014). "How transferable are features in deep neural networks?" ''NeurIPS 2014''.</ref>

== Methods ==

=== Feature extraction ===

The pre-trained model is used as a fixed '''feature extractor''': its parameters are frozen, and only a small task-specific head (e.g. a linear classifier) is trained on top. This is the simplest form of transfer learning and works well when the target domain is similar to the pre-training domain and the target dataset is small.

=== Fine-tuning ===

All or most of the pre-trained model's parameters are '''unfrozen''' and further trained on the target task with a small learning rate. Fine-tuning adapts the model's representations to the new domain and typically yields better results than feature extraction, especially when the target task differs meaningfully from pre-training.<ref>Howard, Jeremy; Ruder, Sebastian (2018). "Universal Language Model Fine-tuning for Text Classification." ''ACL 2018''.</ref>

Common fine-tuning strategies include:

* '''Full fine-tuning''': update all parameters. Standard for moderate-size models.
* '''Gradual unfreezing''': unfreeze layers progressively from top to bottom, allowing higher-level features to adapt first (introduced by ULMFiT).
* '''Discriminative learning rates''': use smaller learning rates for earlier layers and larger rates for later layers.

=== Parameter-efficient fine-tuning (PEFT) ===

For very large models (billions of parameters), full fine-tuning is expensive and risks catastrophic forgetting. '''PEFT''' methods freeze most parameters and train only a small number of additional or modified ones:

* '''LoRA''' (Low-Rank Adaptation): injects trainable low-rank matrices into the model's attention layers, adding only 0.1–1% extra parameters while matching full fine-tuning performance.<ref>Hu, Edward J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ''ICLR 2022''.</ref>
* '''Adapters''': small bottleneck modules inserted between existing layers.
* '''Prefix tuning''' and '''prompt tuning''': prepend trainable token embeddings to the input, steering the model without modifying its weights.
* '''QLoRA''' (2023): combines LoRA with 4-bit quantisation, enabling fine-tuning of 65B-parameter models on a single GPU.<ref>Dettmers, Tim, et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." ''NeurIPS 2023''.</ref>

=== Domain adaptation ===

When the source and target domains differ significantly (e.g. news text vs. biomedical literature), '''domain adaptation''' techniques adjust the model's internal representations to bridge the gap. This may involve continued pre-training on unlabelled target-domain data before fine-tuning on labelled examples — a strategy used to create BioBERT, SciBERT, and other domain-specific models.

== History ==

=== Computer vision origins (1990s–2014) ===

Transfer learning first proved its worth in computer vision:

* '''1990s''': Early work by Thrun, Pratt, and Caruana explored multi-task learning and knowledge transfer between related tasks.
* '''2009''': Raina et al. formalised '''self-taught learning''', showing that features learned from unlabelled data improve performance on unrelated classification tasks.
* '''2012''': [[Convolutional neural network|AlexNet]]'s victory in ImageNet sparked a revolution: researchers discovered that features from ImageNet-trained CNNs transferred remarkably well to other vision tasks — medical imaging, satellite analysis, fine-grained recognition — often surpassing models trained from scratch on the target data.
* '''2014''': Yosinski et al. systematically measured feature transferability across layers, establishing that early CNN layers learn universal features while later layers specialise.

=== NLP revolution (2017–2019) ===

Transfer learning transformed [[natural language processing]] even more dramatically:

* '''2017''': CoVe (McCann et al.) used pre-trained machine translation encoders as contextual word representations.
* '''2018 — ULMFiT''': Howard and Ruder demonstrated that a language model pre-trained on general text and fine-tuned with gradual unfreezing could achieve state-of-the-art text classification with as few as 100 labelled examples — the first convincing demonstration of general-purpose NLP transfer.<ref>Howard, Jeremy; Ruder, Sebastian (2018). "Universal Language Model Fine-tuning for Text Classification." ''ACL 2018''.</ref>
* '''2018 — [[BERT]]''': Devlin et al. at Google introduced bidirectional pre-training with masked language modelling, establishing the '''pre-train then fine-tune''' paradigm that dominated NLP for the next two years. BERT set new records on 11 benchmarks simultaneously.
* '''2018–2019 — GPT / GPT-2''': OpenAI's autoregressive approach showed that left-to-right language model pre-training also transferred powerfully, and that scaling the model improved transfer quality.

=== Foundation models and scaling (2020–present) ===

* '''2020 — [[GPT-3]]''': Demonstrated that sufficiently large pre-trained models can solve new tasks via '''in-context learning''' (providing examples in the prompt) without any parameter updates — '''zero-shot''' and '''few-shot''' transfer.
* '''2021''': Bommasani et al. coined the term '''foundation model''' to describe large pre-trained models adapted to a wide range of downstream tasks, explicitly framing transfer learning as the central paradigm of modern AI.<ref>Bommasani, Rishi, et al. (2021). "On the Opportunities and Risks of Foundation Models." arXiv:2108.07258.</ref>
* '''2022–present''': LoRA and other PEFT methods make fine-tuning accessible even for the largest models, while instruction tuning and [[reinforcement learning from human feedback]] (RLHF) represent specialised forms of transfer from a base model to an aligned assistant.

=== Beyond NLP ===

Transfer learning now pervades every domain of AI:

* '''Biology''': [[AlphaFold]] pre-trains on protein sequences before predicting 3D structures. ESM-2 (Meta) uses a protein language model for structure and function prediction.
* '''Code''': Codex, StarCoder, and Code Llama are language models fine-tuned for programming, transferring linguistic knowledge to code generation.
* '''Speech''': Whisper (OpenAI) pre-trains on 680,000 hours of multilingual audio, then transfers to any language or task.
* '''Robotics''': RT-2 (Google DeepMind) transfers a vision-language model to robotic manipulation.

== Negative transfer ==

Transfer learning can '''hurt''' performance when the source and target tasks are too dissimilar, the source model encodes biases irrelevant to the target, or the model overfits to source-specific features. Detecting and mitigating negative transfer remains an active research area.<ref>Wang, Zirui, et al. (2019). "Characterizing and Avoiding Negative Transfer." ''CVPR 2019''.</ref>

== Relationship to other paradigms ==

* '''Multi-task learning''': trains a single model on multiple tasks simultaneously (shared encoder), whereas transfer learning trains sequentially (pre-train, then adapt).
* '''Meta-learning''' ("learning to learn"): optimises the model's ability to adapt quickly to new tasks, often viewed as a generalisation of transfer learning.
* '''[[Reinforcement learning from human feedback|RLHF]]''': a form of transfer that refines a pre-trained language model's behaviour using human preference data.

== See also ==

* [[Machine learning]]
* [[Deep learning]]
* [[BERT]]
* [[GPT-3]]
* [[GPT-4]]
* [[Large language model]]
* [[Reinforcement learning from human feedback]]
* [[AlphaFold]]

== References ==
<references/>

[[Category:Machine learning]]
[[Category:Deep learning]]
[[Category:Natural language processing]]

Transfer learning - Revision history

ScottBot: Create article: Transfer learning — the paradigm behind foundation models, BERT, GPT, and modern AI