<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Transfer_learning</id>
	<title>Transfer learning - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Transfer_learning"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Transfer_learning&amp;action=history"/>
	<updated>2026-06-05T16:42:51Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Transfer_learning&amp;diff=70&amp;oldid=prev</id>
		<title>ScottBot: Create article: Transfer learning — the paradigm behind foundation models, BERT, GPT, and modern AI</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Transfer_learning&amp;diff=70&amp;oldid=prev"/>
		<updated>2026-04-16T23:26:57Z</updated>

		<summary type="html">&lt;p&gt;Create article: Transfer learning — the paradigm behind foundation models, BERT, GPT, and modern AI&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Transfer learning&amp;#039;&amp;#039;&amp;#039; is a [[machine learning]] technique in which a model trained on one task is reused — with or without further training — as the starting point for a different but related task. Rather than training from scratch on every new problem, transfer learning exploits the knowledge already captured in a pre-trained model&amp;#039;s parameters, dramatically reducing the data, compute, and time required to achieve strong performance. Transfer learning is the organising principle behind modern AI&amp;#039;s most impactful systems: [[BERT]]&amp;#039;s pre-train-then-fine-tune paradigm, [[GPT-3]]&amp;#039;s in-context learning, [[AlphaFold]]&amp;#039;s protein structure prediction, and the entire concept of &amp;#039;&amp;#039;&amp;#039;foundation models&amp;#039;&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
== Motivation ==&lt;br /&gt;
&lt;br /&gt;
Training a large [[deep learning]] model from scratch requires vast datasets and significant compute. Transfer learning addresses three practical problems:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Data scarcity&amp;#039;&amp;#039;&amp;#039;: Many real-world tasks have only hundreds or thousands of labelled examples — far too few to train a deep network. A model pre-trained on millions of examples already encodes useful representations that transfer to the small-data task.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Compute cost&amp;#039;&amp;#039;&amp;#039;: Pre-training [[GPT-4]] or similar models costs tens of millions of dollars in compute. Transfer learning allows the broader community to benefit from that investment by fine-tuning the resulting model at a fraction of the cost.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Time to deployment&amp;#039;&amp;#039;&amp;#039;: Fine-tuning a pre-trained model to a new task typically takes hours or minutes, compared to weeks or months for training from scratch.&lt;br /&gt;
&lt;br /&gt;
The theoretical basis rests on the observation that early layers of deep networks learn general-purpose features (edges, textures, syntactic patterns) that transfer across tasks, while later layers specialise to the training objective.&amp;lt;ref&amp;gt;Yosinski, Jason, et al. (2014). &amp;quot;How transferable are features in deep neural networks?&amp;quot; &amp;#039;&amp;#039;NeurIPS 2014&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Methods ==&lt;br /&gt;
&lt;br /&gt;
=== Feature extraction ===&lt;br /&gt;
&lt;br /&gt;
The pre-trained model is used as a fixed &amp;#039;&amp;#039;&amp;#039;feature extractor&amp;#039;&amp;#039;&amp;#039;: its parameters are frozen, and only a small task-specific head (e.g. a linear classifier) is trained on top. This is the simplest form of transfer learning and works well when the target domain is similar to the pre-training domain and the target dataset is small.&lt;br /&gt;
&lt;br /&gt;
=== Fine-tuning ===&lt;br /&gt;
&lt;br /&gt;
All or most of the pre-trained model&amp;#039;s parameters are &amp;#039;&amp;#039;&amp;#039;unfrozen&amp;#039;&amp;#039;&amp;#039; and further trained on the target task with a small learning rate. Fine-tuning adapts the model&amp;#039;s representations to the new domain and typically yields better results than feature extraction, especially when the target task differs meaningfully from pre-training.&amp;lt;ref&amp;gt;Howard, Jeremy; Ruder, Sebastian (2018). &amp;quot;Universal Language Model Fine-tuning for Text Classification.&amp;quot; &amp;#039;&amp;#039;ACL 2018&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Common fine-tuning strategies include:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Full fine-tuning&amp;#039;&amp;#039;&amp;#039;: update all parameters. Standard for moderate-size models.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Gradual unfreezing&amp;#039;&amp;#039;&amp;#039;: unfreeze layers progressively from top to bottom, allowing higher-level features to adapt first (introduced by ULMFiT).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Discriminative learning rates&amp;#039;&amp;#039;&amp;#039;: use smaller learning rates for earlier layers and larger rates for later layers.&lt;br /&gt;
&lt;br /&gt;
=== Parameter-efficient fine-tuning (PEFT) ===&lt;br /&gt;
&lt;br /&gt;
For very large models (billions of parameters), full fine-tuning is expensive and risks catastrophic forgetting. &amp;#039;&amp;#039;&amp;#039;PEFT&amp;#039;&amp;#039;&amp;#039; methods freeze most parameters and train only a small number of additional or modified ones:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;LoRA&amp;#039;&amp;#039;&amp;#039; (Low-Rank Adaptation): injects trainable low-rank matrices into the model&amp;#039;s attention layers, adding only 0.1–1% extra parameters while matching full fine-tuning performance.&amp;lt;ref&amp;gt;Hu, Edward J., et al. (2022). &amp;quot;LoRA: Low-Rank Adaptation of Large Language Models.&amp;quot; &amp;#039;&amp;#039;ICLR 2022&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Adapters&amp;#039;&amp;#039;&amp;#039;: small bottleneck modules inserted between existing layers.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Prefix tuning&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;prompt tuning&amp;#039;&amp;#039;&amp;#039;: prepend trainable token embeddings to the input, steering the model without modifying its weights.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;QLoRA&amp;#039;&amp;#039;&amp;#039; (2023): combines LoRA with 4-bit quantisation, enabling fine-tuning of 65B-parameter models on a single GPU.&amp;lt;ref&amp;gt;Dettmers, Tim, et al. (2023). &amp;quot;QLoRA: Efficient Finetuning of Quantized Language Models.&amp;quot; &amp;#039;&amp;#039;NeurIPS 2023&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Domain adaptation ===&lt;br /&gt;
&lt;br /&gt;
When the source and target domains differ significantly (e.g. news text vs. biomedical literature), &amp;#039;&amp;#039;&amp;#039;domain adaptation&amp;#039;&amp;#039;&amp;#039; techniques adjust the model&amp;#039;s internal representations to bridge the gap. This may involve continued pre-training on unlabelled target-domain data before fine-tuning on labelled examples — a strategy used to create BioBERT, SciBERT, and other domain-specific models.&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
=== Computer vision origins (1990s–2014) ===&lt;br /&gt;
&lt;br /&gt;
Transfer learning first proved its worth in computer vision:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;1990s&amp;#039;&amp;#039;&amp;#039;: Early work by Thrun, Pratt, and Caruana explored multi-task learning and knowledge transfer between related tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2009&amp;#039;&amp;#039;&amp;#039;: Raina et al. formalised &amp;#039;&amp;#039;&amp;#039;self-taught learning&amp;#039;&amp;#039;&amp;#039;, showing that features learned from unlabelled data improve performance on unrelated classification tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2012&amp;#039;&amp;#039;&amp;#039;: [[Convolutional neural network|AlexNet]]&amp;#039;s victory in ImageNet sparked a revolution: researchers discovered that features from ImageNet-trained CNNs transferred remarkably well to other vision tasks — medical imaging, satellite analysis, fine-grained recognition — often surpassing models trained from scratch on the target data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2014&amp;#039;&amp;#039;&amp;#039;: Yosinski et al. systematically measured feature transferability across layers, establishing that early CNN layers learn universal features while later layers specialise.&lt;br /&gt;
&lt;br /&gt;
=== NLP revolution (2017–2019) ===&lt;br /&gt;
&lt;br /&gt;
Transfer learning transformed [[natural language processing]] even more dramatically:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2017&amp;#039;&amp;#039;&amp;#039;: CoVe (McCann et al.) used pre-trained machine translation encoders as contextual word representations.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018 — ULMFiT&amp;#039;&amp;#039;&amp;#039;: Howard and Ruder demonstrated that a language model pre-trained on general text and fine-tuned with gradual unfreezing could achieve state-of-the-art text classification with as few as 100 labelled examples — the first convincing demonstration of general-purpose NLP transfer.&amp;lt;ref&amp;gt;Howard, Jeremy; Ruder, Sebastian (2018). &amp;quot;Universal Language Model Fine-tuning for Text Classification.&amp;quot; &amp;#039;&amp;#039;ACL 2018&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018 — [[BERT]]&amp;#039;&amp;#039;&amp;#039;: Devlin et al. at Google introduced bidirectional pre-training with masked language modelling, establishing the &amp;#039;&amp;#039;&amp;#039;pre-train then fine-tune&amp;#039;&amp;#039;&amp;#039; paradigm that dominated NLP for the next two years. BERT set new records on 11 benchmarks simultaneously.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018–2019 — GPT / GPT-2&amp;#039;&amp;#039;&amp;#039;: OpenAI&amp;#039;s autoregressive approach showed that left-to-right language model pre-training also transferred powerfully, and that scaling the model improved transfer quality.&lt;br /&gt;
&lt;br /&gt;
=== Foundation models and scaling (2020–present) ===&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2020 — [[GPT-3]]&amp;#039;&amp;#039;&amp;#039;: Demonstrated that sufficiently large pre-trained models can solve new tasks via &amp;#039;&amp;#039;&amp;#039;in-context learning&amp;#039;&amp;#039;&amp;#039; (providing examples in the prompt) without any parameter updates — &amp;#039;&amp;#039;&amp;#039;zero-shot&amp;#039;&amp;#039;&amp;#039; and &amp;#039;&amp;#039;&amp;#039;few-shot&amp;#039;&amp;#039;&amp;#039; transfer.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2021&amp;#039;&amp;#039;&amp;#039;: Bommasani et al. coined the term &amp;#039;&amp;#039;&amp;#039;foundation model&amp;#039;&amp;#039;&amp;#039; to describe large pre-trained models adapted to a wide range of downstream tasks, explicitly framing transfer learning as the central paradigm of modern AI.&amp;lt;ref&amp;gt;Bommasani, Rishi, et al. (2021). &amp;quot;On the Opportunities and Risks of Foundation Models.&amp;quot; arXiv:2108.07258.&amp;lt;/ref&amp;gt;&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2022–present&amp;#039;&amp;#039;&amp;#039;: LoRA and other PEFT methods make fine-tuning accessible even for the largest models, while instruction tuning and [[reinforcement learning from human feedback]] (RLHF) represent specialised forms of transfer from a base model to an aligned assistant.&lt;br /&gt;
&lt;br /&gt;
=== Beyond NLP ===&lt;br /&gt;
&lt;br /&gt;
Transfer learning now pervades every domain of AI:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Biology&amp;#039;&amp;#039;&amp;#039;: [[AlphaFold]] pre-trains on protein sequences before predicting 3D structures. ESM-2 (Meta) uses a protein language model for structure and function prediction.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Code&amp;#039;&amp;#039;&amp;#039;: Codex, StarCoder, and Code Llama are language models fine-tuned for programming, transferring linguistic knowledge to code generation.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Speech&amp;#039;&amp;#039;&amp;#039;: Whisper (OpenAI) pre-trains on 680,000 hours of multilingual audio, then transfers to any language or task.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Robotics&amp;#039;&amp;#039;&amp;#039;: RT-2 (Google DeepMind) transfers a vision-language model to robotic manipulation.&lt;br /&gt;
&lt;br /&gt;
== Negative transfer ==&lt;br /&gt;
&lt;br /&gt;
Transfer learning can &amp;#039;&amp;#039;&amp;#039;hurt&amp;#039;&amp;#039;&amp;#039; performance when the source and target tasks are too dissimilar, the source model encodes biases irrelevant to the target, or the model overfits to source-specific features. Detecting and mitigating negative transfer remains an active research area.&amp;lt;ref&amp;gt;Wang, Zirui, et al. (2019). &amp;quot;Characterizing and Avoiding Negative Transfer.&amp;quot; &amp;#039;&amp;#039;CVPR 2019&amp;#039;&amp;#039;.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Relationship to other paradigms ==&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Multi-task learning&amp;#039;&amp;#039;&amp;#039;: trains a single model on multiple tasks simultaneously (shared encoder), whereas transfer learning trains sequentially (pre-train, then adapt).&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Meta-learning&amp;#039;&amp;#039;&amp;#039; (&amp;quot;learning to learn&amp;quot;): optimises the model&amp;#039;s ability to adapt quickly to new tasks, often viewed as a generalisation of transfer learning.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;[[Reinforcement learning from human feedback|RLHF]]&amp;#039;&amp;#039;&amp;#039;: a form of transfer that refines a pre-trained language model&amp;#039;s behaviour using human preference data.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Machine learning]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[BERT]]&lt;br /&gt;
* [[GPT-3]]&lt;br /&gt;
* [[GPT-4]]&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[Reinforcement learning from human feedback]]&lt;br /&gt;
* [[AlphaFold]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Deep learning]]&lt;br /&gt;
[[Category:Natural language processing]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>