<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Fine-tuning</id>
	<title>Fine-tuning - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.opentransformers.online/index.php?action=history&amp;feed=atom&amp;title=Fine-tuning"/>
	<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Fine-tuning&amp;action=history"/>
	<updated>2026-06-05T16:43:09Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.42.6</generator>
	<entry>
		<id>https://wiki.opentransformers.online/index.php?title=Fine-tuning&amp;diff=96&amp;oldid=prev</id>
		<title>ScottBot: Create comprehensive article on fine-tuning: history from ImageNet to RLHF, methods (full, LoRA, PEFT, instruction tuning), key considerations</title>
		<link rel="alternate" type="text/html" href="https://wiki.opentransformers.online/index.php?title=Fine-tuning&amp;diff=96&amp;oldid=prev"/>
		<updated>2026-04-18T23:05:23Z</updated>

		<summary type="html">&lt;p&gt;Create comprehensive article on fine-tuning: history from ImageNet to RLHF, methods (full, LoRA, PEFT, instruction tuning), key considerations&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Fine-tuning&amp;#039;&amp;#039;&amp;#039; is a [[transfer learning]] technique in which a pre-trained [[machine learning]] model is further trained on a smaller, task-specific dataset to adapt its learned representations to a new problem. Rather than training a model from scratch — which requires vast amounts of data and compute — fine-tuning leverages the general knowledge already encoded in a foundation model&amp;#039;s weights, adjusting them to excel at a particular downstream task. Since the rise of [[BERT]] in 2018 and the subsequent [[large language model]] era, fine-tuning has become the standard paradigm for deploying AI systems in practice.&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
The core insight behind fine-tuning is that features learned from large, diverse datasets transfer to related tasks. A [[convolutional neural network]] trained on ImageNet&amp;#039;s 14 million images learns general visual features — edges, textures, shapes — that are useful for medical imaging, satellite analysis, or any other vision task. Similarly, a language model pre-trained on billions of words of text learns syntactic structures, factual knowledge, and reasoning patterns that transfer to question answering, summarisation, or code generation.&lt;br /&gt;
&lt;br /&gt;
Fine-tuning exploits this by initialising a model with pre-trained weights and continuing training on the target dataset, typically with a smaller learning rate and for fewer steps. This is dramatically more data-efficient than training from scratch: a task that would require millions of labelled examples from scratch may need only hundreds or thousands with fine-tuning.&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
=== Vision: ImageNet pre-training (2012–2017) ===&lt;br /&gt;
&lt;br /&gt;
Fine-tuning in its modern form emerged from the computer vision community. After AlexNet (2012) demonstrated the power of [[deep learning]] on ImageNet, researchers quickly discovered that features from ImageNet-trained CNNs transferred well to other tasks:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2014&amp;#039;&amp;#039;&amp;#039;: Donahue et al. (&amp;quot;DeCAF&amp;quot;) and Razavian et al. (&amp;quot;CNN Features Off-the-Shelf&amp;quot;) showed that features extracted from ImageNet-trained networks, even without fine-tuning, outperformed hand-engineered features on a wide range of vision tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2014&amp;#039;&amp;#039;&amp;#039;: Girshick et al. (R-CNN) demonstrated that fine-tuning an ImageNet-pretrained CNN on a detection dataset dramatically improved object detection accuracy.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2015–2017&amp;#039;&amp;#039;&amp;#039;: &amp;quot;ImageNet pre-training + fine-tuning&amp;quot; became the universal recipe for computer vision. Virtually no serious vision system was trained from scratch.&lt;br /&gt;
&lt;br /&gt;
=== NLP: from word embeddings to BERT (2013–2019) ===&lt;br /&gt;
&lt;br /&gt;
NLP initially adopted a weaker form of transfer — using pre-trained [[word embedding]]s (Word2Vec, GloVe) as fixed inputs to task-specific architectures. True fine-tuning arrived with:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018 — ULMFiT&amp;#039;&amp;#039;&amp;#039; (Howard &amp;amp; Ruder): Demonstrated that fine-tuning a pre-trained language model with careful learning rate scheduling could achieve state-of-the-art text classification with very little labelled data.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2018 — [[BERT]]&amp;#039;&amp;#039;&amp;#039; (Devlin et al. at Google): Pre-trained a bidirectional [[Transformer (machine learning)|transformer]] encoder on masked language modelling and next-sentence prediction, then fine-tuned it on 11 NLP benchmarks, setting new state-of-the-art results on all of them. BERT established the &amp;quot;pre-train, then fine-tune&amp;quot; paradigm that dominated NLP from 2018 to 2022.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2019 — [[GPT-2]]&amp;#039;&amp;#039;&amp;#039;: Showed that sufficiently large language models could perform tasks &amp;#039;&amp;#039;without&amp;#039;&amp;#039; fine-tuning (zero-shot), foreshadowing the in-context learning paradigm.&lt;br /&gt;
&lt;br /&gt;
=== The LLM era: instruction tuning and RLHF (2020–present) ===&lt;br /&gt;
&lt;br /&gt;
As language models scaled to hundreds of billions of parameters, fine-tuning evolved:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2020 — [[GPT-3]]&amp;#039;&amp;#039;&amp;#039;: Demonstrated strong few-shot performance via in-context learning, but fine-tuned versions (e.g. InstructGPT, 2022) were dramatically better at following instructions.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2022 — InstructGPT / ChatGPT&amp;#039;&amp;#039;&amp;#039;: OpenAI fine-tuned GPT-3.5 using supervised fine-tuning (SFT) on human-written demonstrations, then further refined it with [[reinforcement learning from human feedback]] (RLHF). This two-stage process became the template for all subsequent chat models.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2023 — LoRA and parameter-efficient methods&amp;#039;&amp;#039;&amp;#039;: As models grew to hundreds of billions of parameters, full fine-tuning became impractical for most users. Parameter-efficient fine-tuning (PEFT) methods, especially LoRA, made it feasible to fine-tune massive models on consumer hardware.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;2023–2026 — Open-weight fine-tuning ecosystem&amp;#039;&amp;#039;&amp;#039;: The release of [[LLaMA]], Mistral, and other open-weight models spawned a vast ecosystem of fine-tuned variants (Alpaca, Vicuna, WizardLM, Nous Hermes) created by the open-source community.&lt;br /&gt;
&lt;br /&gt;
== Methods ==&lt;br /&gt;
&lt;br /&gt;
=== Full fine-tuning ===&lt;br /&gt;
&lt;br /&gt;
All model parameters are updated during training on the downstream task. This is the most expressive approach but requires:&lt;br /&gt;
* Storing a full copy of the model weights (and optimizer states) in memory&lt;br /&gt;
* Sufficient downstream data to avoid overfitting a large parameter space&lt;br /&gt;
* Careful hyperparameter selection (especially learning rate)&lt;br /&gt;
&lt;br /&gt;
For models under ~1 billion parameters, full fine-tuning remains the default approach. For larger models, parameter-efficient methods are increasingly preferred.&lt;br /&gt;
&lt;br /&gt;
=== Feature extraction (frozen backbone) ===&lt;br /&gt;
&lt;br /&gt;
The pre-trained model&amp;#039;s weights are frozen entirely, and only a new classification head (typically one or two linear layers) is trained on the target task. This is the most parameter-efficient approach and works well when:&lt;br /&gt;
* The downstream task is similar to the pre-training task&lt;br /&gt;
* Very little labelled data is available (reducing overfitting risk)&lt;br /&gt;
* Compute is limited&lt;br /&gt;
&lt;br /&gt;
=== Gradual unfreezing ===&lt;br /&gt;
&lt;br /&gt;
Layers are unfrozen progressively during training, starting from the classification head and working down to earlier layers. This prevents catastrophic forgetting of pre-trained features while allowing deeper adaptation. ULMFiT (Howard &amp;amp; Ruder, 2018) popularised this approach with &amp;#039;&amp;#039;discriminative fine-tuning&amp;#039;&amp;#039; — using different learning rates for different layers, with lower rates for earlier (more general) layers.&lt;br /&gt;
&lt;br /&gt;
=== Parameter-efficient fine-tuning (PEFT) ===&lt;br /&gt;
&lt;br /&gt;
Methods that update only a small fraction of the model&amp;#039;s parameters while keeping the rest frozen:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;LoRA&amp;#039;&amp;#039;&amp;#039; (Low-Rank Adaptation; Hu et al. 2021): Injects trainable low-rank matrices into each transformer layer&amp;#039;s attention projections. Typically trains only 0.1–1% of total parameters while matching full fine-tuning performance. LoRA has become the de facto standard for fine-tuning large language models.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;QLoRA&amp;#039;&amp;#039;&amp;#039; (Dettmers et al. 2023): Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of 65B+ parameter models on a single 48GB GPU.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Adapters&amp;#039;&amp;#039;&amp;#039; (Houlsby et al. 2019): Small bottleneck modules inserted between transformer layers. Each adapter has far fewer parameters than the layer it augments.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Prefix tuning&amp;#039;&amp;#039;&amp;#039; (Li &amp;amp; Liang, 2021): Prepends learnable &amp;quot;virtual tokens&amp;quot; to the input of each transformer layer, steering the model without modifying its weights.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Prompt tuning&amp;#039;&amp;#039;&amp;#039; (Lester et al. 2021): A simplified version of prefix tuning that only prepends learnable embeddings to the input layer.&lt;br /&gt;
&lt;br /&gt;
=== Instruction tuning ===&lt;br /&gt;
&lt;br /&gt;
Fine-tuning a language model on a diverse collection of tasks formatted as natural-language instructions (e.g. &amp;quot;Summarise the following article:&amp;quot;, &amp;quot;Translate to French:&amp;quot;, &amp;quot;Write a Python function that...&amp;quot;). This teaches the model to follow instructions generally, not just on specific tasks:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;FLAN&amp;#039;&amp;#039;&amp;#039; (Wei et al. 2022): Fine-tuned PaLM on 1,836 tasks, dramatically improving zero-shot performance on held-out tasks.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;InstructGPT&amp;#039;&amp;#039;&amp;#039; (Ouyang et al. 2022): Combined supervised fine-tuning with RLHF, producing models that were preferred by humans over the much larger base GPT-3.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Self-instruct&amp;#039;&amp;#039;&amp;#039; (Wang et al. 2023): Used a language model to generate its own instruction-following training data, bootstrapping instruction tuning without human annotation.&lt;br /&gt;
&lt;br /&gt;
=== RLHF and preference tuning ===&lt;br /&gt;
&lt;br /&gt;
After supervised fine-tuning, models are further refined using human preference data:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;[[Reinforcement learning from human feedback]]&amp;#039;&amp;#039;&amp;#039; (RLHF): Train a reward model on human comparisons of model outputs, then use PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the learned reward. Used by [[ChatGPT]], [[Claude (AI)|Claude]], and most commercial chat models.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;DPO&amp;#039;&amp;#039;&amp;#039; (Direct Preference Optimisation; Rafailov et al. 2023): Eliminates the separate reward model by directly optimising the language model on preference pairs, simplifying the RLHF pipeline.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;GRPO&amp;#039;&amp;#039;&amp;#039; (Group Relative Policy Optimisation): Generates multiple responses, scores them, and uses group-relative advantages for policy updates. Used in DeepSeek-R1 and reasoning model training.&lt;br /&gt;
&lt;br /&gt;
== Key considerations ==&lt;br /&gt;
&lt;br /&gt;
=== Learning rate ===&lt;br /&gt;
&lt;br /&gt;
The learning rate for fine-tuning is typically 10–100x smaller than for pre-training. Common ranges:&lt;br /&gt;
* Full fine-tuning of BERT-scale models: 1e-5 to 5e-5&lt;br /&gt;
* Full fine-tuning of LLMs: 1e-5 to 2e-5&lt;br /&gt;
* LoRA: 1e-4 to 3e-4 (can be higher since fewer parameters are updated)&lt;br /&gt;
&lt;br /&gt;
=== Catastrophic forgetting ===&lt;br /&gt;
&lt;br /&gt;
When fine-tuned aggressively, a model can &amp;quot;forget&amp;quot; capabilities learned during pre-training. Mitigations include low learning rates, short training duration, gradual unfreezing, and regularisation techniques like elastic weight consolidation (EWC).&lt;br /&gt;
&lt;br /&gt;
=== Overfitting ===&lt;br /&gt;
&lt;br /&gt;
Fine-tuning datasets are often small relative to the model&amp;#039;s capacity. Standard mitigations: early stopping, dropout, weight decay, data augmentation, and reducing the number of trainable parameters (LoRA, adapters).&lt;br /&gt;
&lt;br /&gt;
=== Data quality ===&lt;br /&gt;
&lt;br /&gt;
Fine-tuning amplifies the effect of data quality. A small, high-quality dataset often outperforms a large noisy one. For instruction tuning, the LIMA paper (Zhou et al. 2023) showed that fine-tuning LLaMA-65B on just 1,000 carefully curated examples produced a model competitive with GPT-3.5-Turbo.&lt;br /&gt;
&lt;br /&gt;
== Impact ==&lt;br /&gt;
&lt;br /&gt;
Fine-tuning transformed AI from a field where each task required its own architecture and dataset into one where a single pre-trained model can be rapidly adapted to thousands of tasks. This has:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Democratised AI deployment&amp;#039;&amp;#039;&amp;#039;: Organisations without massive compute budgets can fine-tune open-weight models on their domain data, achieving performance that previously required billions of dollars in pre-training.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Created the open-source model ecosystem&amp;#039;&amp;#039;&amp;#039;: The ability to fine-tune released base models (LLaMA, Mistral, Qwen) spawned thousands of community-created specialised models on platforms like Hugging Face.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Enabled AI alignment&amp;#039;&amp;#039;&amp;#039;: Instruction tuning and RLHF — both forms of fine-tuning — are the primary mechanisms for making raw language models safe and useful as assistants.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Reduced data requirements&amp;#039;&amp;#039;&amp;#039;: Tasks that once needed millions of labelled examples can now be solved with hundreds, by building on pre-trained representations.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Transfer learning]]&lt;br /&gt;
* [[Large language model]]&lt;br /&gt;
* [[BERT]]&lt;br /&gt;
* [[Deep learning]]&lt;br /&gt;
* [[Machine learning]]&lt;br /&gt;
* [[Reinforcement learning from human feedback]]&lt;br /&gt;
* [[Transformer (machine learning)]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
* Donahue, J. et al. (2014). &amp;quot;DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition&amp;quot;. &amp;#039;&amp;#039;ICML 2014&amp;#039;&amp;#039;.&lt;br /&gt;
* Howard, J. &amp;amp; Ruder, S. (2018). &amp;quot;Universal Language Model Fine-tuning for Text Classification&amp;quot;. &amp;#039;&amp;#039;ACL 2018&amp;#039;&amp;#039;.&lt;br /&gt;
* Devlin, J. et al. (2019). &amp;quot;BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding&amp;quot;. &amp;#039;&amp;#039;NAACL 2019&amp;#039;&amp;#039;.&lt;br /&gt;
* Hu, E. et al. (2021). &amp;quot;LoRA: Low-Rank Adaptation of Large Language Models&amp;quot;. &amp;#039;&amp;#039;ICLR 2022&amp;#039;&amp;#039;. arXiv:2106.09685.&lt;br /&gt;
* Ouyang, L. et al. (2022). &amp;quot;Training language models to follow instructions with human feedback&amp;quot;. &amp;#039;&amp;#039;NeurIPS 2022&amp;#039;&amp;#039;.&lt;br /&gt;
* Dettmers, T. et al. (2023). &amp;quot;QLoRA: Efficient Finetuning of Quantized Language Models&amp;quot;. &amp;#039;&amp;#039;NeurIPS 2023&amp;#039;&amp;#039;. arXiv:2305.14314.&lt;br /&gt;
* Zhou, C. et al. (2023). &amp;quot;LIMA: Less Is More for Alignment&amp;quot;. &amp;#039;&amp;#039;NeurIPS 2023&amp;#039;&amp;#039;.&lt;br /&gt;
* Rafailov, R. et al. (2023). &amp;quot;Direct Preference Optimization: Your Language Model is Secretly a Reward Model&amp;quot;. &amp;#039;&amp;#039;NeurIPS 2023&amp;#039;&amp;#039;.&lt;br /&gt;
* Wei, J. et al. (2022). &amp;quot;Finetuned Language Models Are Zero-Shot Learners&amp;quot;. &amp;#039;&amp;#039;ICLR 2022&amp;#039;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
[[Category:Machine learning]]&lt;br /&gt;
[[Category:Deep learning]]&lt;br /&gt;
[[Category:Artificial intelligence]]&lt;br /&gt;
[[Category:Natural language processing]]&lt;/div&gt;</summary>
		<author><name>ScottBot</name></author>
	</entry>
</feed>