Fine-tuning

Fine-tuning is a transfer learning technique in which a pre-trained machine learning model is further trained on a smaller, task-specific dataset to adapt its learned representations to a new problem. Rather than training a model from scratch — which requires vast amounts of data and compute — fine-tuning leverages the general knowledge already encoded in a foundation model's weights, adjusting them to excel at a particular downstream task. Since the rise of BERT in 2018 and the subsequent large language model era, fine-tuning has become the standard paradigm for deploying AI systems in practice.

Overview

The core insight behind fine-tuning is that features learned from large, diverse datasets transfer to related tasks. A convolutional neural network trained on ImageNet's 14 million images learns general visual features — edges, textures, shapes — that are useful for medical imaging, satellite analysis, or any other vision task. Similarly, a language model pre-trained on billions of words of text learns syntactic structures, factual knowledge, and reasoning patterns that transfer to question answering, summarisation, or code generation.

Fine-tuning exploits this by initialising a model with pre-trained weights and continuing training on the target dataset, typically with a smaller learning rate and for fewer steps. This is dramatically more data-efficient than training from scratch: a task that would require millions of labelled examples from scratch may need only hundreds or thousands with fine-tuning.

History

Vision: ImageNet pre-training (2012–2017)

Fine-tuning in its modern form emerged from the computer vision community. After AlexNet (2012) demonstrated the power of deep learning on ImageNet, researchers quickly discovered that features from ImageNet-trained CNNs transferred well to other tasks:

2014: Donahue et al. ("DeCAF") and Razavian et al. ("CNN Features Off-the-Shelf") showed that features extracted from ImageNet-trained networks, even without fine-tuning, outperformed hand-engineered features on a wide range of vision tasks.
2014: Girshick et al. (R-CNN) demonstrated that fine-tuning an ImageNet-pretrained CNN on a detection dataset dramatically improved object detection accuracy.
2015–2017: "ImageNet pre-training + fine-tuning" became the universal recipe for computer vision. Virtually no serious vision system was trained from scratch.

NLP: from word embeddings to BERT (2013–2019)

NLP initially adopted a weaker form of transfer — using pre-trained word embeddings (Word2Vec, GloVe) as fixed inputs to task-specific architectures. True fine-tuning arrived with:

2018 — ULMFiT (Howard & Ruder): Demonstrated that fine-tuning a pre-trained language model with careful learning rate scheduling could achieve state-of-the-art text classification with very little labelled data.
2018 — BERT (Devlin et al. at Google): Pre-trained a bidirectional transformer encoder on masked language modelling and next-sentence prediction, then fine-tuned it on 11 NLP benchmarks, setting new state-of-the-art results on all of them. BERT established the "pre-train, then fine-tune" paradigm that dominated NLP from 2018 to 2022.
2019 — GPT-2: Showed that sufficiently large language models could perform tasks without fine-tuning (zero-shot), foreshadowing the in-context learning paradigm.

The LLM era: instruction tuning and RLHF (2020–present)

As language models scaled to hundreds of billions of parameters, fine-tuning evolved:

2020 — GPT-3: Demonstrated strong few-shot performance via in-context learning, but fine-tuned versions (e.g. InstructGPT, 2022) were dramatically better at following instructions.
2022 — InstructGPT / ChatGPT: OpenAI fine-tuned GPT-3.5 using supervised fine-tuning (SFT) on human-written demonstrations, then further refined it with reinforcement learning from human feedback (RLHF). This two-stage process became the template for all subsequent chat models.
2023 — LoRA and parameter-efficient methods: As models grew to hundreds of billions of parameters, full fine-tuning became impractical for most users. Parameter-efficient fine-tuning (PEFT) methods, especially LoRA, made it feasible to fine-tune massive models on consumer hardware.
2023–2026 — Open-weight fine-tuning ecosystem: The release of LLaMA, Mistral, and other open-weight models spawned a vast ecosystem of fine-tuned variants (Alpaca, Vicuna, WizardLM, Nous Hermes) created by the open-source community.

Methods

Full fine-tuning

All model parameters are updated during training on the downstream task. This is the most expressive approach but requires:

Storing a full copy of the model weights (and optimizer states) in memory
Sufficient downstream data to avoid overfitting a large parameter space
Careful hyperparameter selection (especially learning rate)

For models under ~1 billion parameters, full fine-tuning remains the default approach. For larger models, parameter-efficient methods are increasingly preferred.

Feature extraction (frozen backbone)

The pre-trained model's weights are frozen entirely, and only a new classification head (typically one or two linear layers) is trained on the target task. This is the most parameter-efficient approach and works well when:

The downstream task is similar to the pre-training task
Very little labelled data is available (reducing overfitting risk)
Compute is limited

Gradual unfreezing

Layers are unfrozen progressively during training, starting from the classification head and working down to earlier layers. This prevents catastrophic forgetting of pre-trained features while allowing deeper adaptation. ULMFiT (Howard & Ruder, 2018) popularised this approach with discriminative fine-tuning — using different learning rates for different layers, with lower rates for earlier (more general) layers.

Parameter-efficient fine-tuning (PEFT)

Methods that update only a small fraction of the model's parameters while keeping the rest frozen:

LoRA (Low-Rank Adaptation; Hu et al. 2021): Injects trainable low-rank matrices into each transformer layer's attention projections. Typically trains only 0.1–1% of total parameters while matching full fine-tuning performance. LoRA has become the de facto standard for fine-tuning large language models.
QLoRA (Dettmers et al. 2023): Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of 65B+ parameter models on a single 48GB GPU.
Adapters (Houlsby et al. 2019): Small bottleneck modules inserted between transformer layers. Each adapter has far fewer parameters than the layer it augments.
Prefix tuning (Li & Liang, 2021): Prepends learnable "virtual tokens" to the input of each transformer layer, steering the model without modifying its weights.
Prompt tuning (Lester et al. 2021): A simplified version of prefix tuning that only prepends learnable embeddings to the input layer.

Instruction tuning

Fine-tuning a language model on a diverse collection of tasks formatted as natural-language instructions (e.g. "Summarise the following article:", "Translate to French:", "Write a Python function that..."). This teaches the model to follow instructions generally, not just on specific tasks:

FLAN (Wei et al. 2022): Fine-tuned PaLM on 1,836 tasks, dramatically improving zero-shot performance on held-out tasks.
InstructGPT (Ouyang et al. 2022): Combined supervised fine-tuning with RLHF, producing models that were preferred by humans over the much larger base GPT-3.
Self-instruct (Wang et al. 2023): Used a language model to generate its own instruction-following training data, bootstrapping instruction tuning without human annotation.

RLHF and preference tuning

After supervised fine-tuning, models are further refined using human preference data:

Reinforcement learning from human feedback (RLHF): Train a reward model on human comparisons of model outputs, then use PPO (Proximal Policy Optimisation) to fine-tune the language model to maximise the learned reward. Used by ChatGPT, Claude, and most commercial chat models.
DPO (Direct Preference Optimisation; Rafailov et al. 2023): Eliminates the separate reward model by directly optimising the language model on preference pairs, simplifying the RLHF pipeline.
GRPO (Group Relative Policy Optimisation): Generates multiple responses, scores them, and uses group-relative advantages for policy updates. Used in DeepSeek-R1 and reasoning model training.

Key considerations

Learning rate

The learning rate for fine-tuning is typically 10–100x smaller than for pre-training. Common ranges:

Full fine-tuning of BERT-scale models: 1e-5 to 5e-5
Full fine-tuning of LLMs: 1e-5 to 2e-5
LoRA: 1e-4 to 3e-4 (can be higher since fewer parameters are updated)

Catastrophic forgetting

When fine-tuned aggressively, a model can "forget" capabilities learned during pre-training. Mitigations include low learning rates, short training duration, gradual unfreezing, and regularisation techniques like elastic weight consolidation (EWC).

Overfitting

Fine-tuning datasets are often small relative to the model's capacity. Standard mitigations: early stopping, dropout, weight decay, data augmentation, and reducing the number of trainable parameters (LoRA, adapters).

Data quality

Fine-tuning amplifies the effect of data quality. A small, high-quality dataset often outperforms a large noisy one. For instruction tuning, the LIMA paper (Zhou et al. 2023) showed that fine-tuning LLaMA-65B on just 1,000 carefully curated examples produced a model competitive with GPT-3.5-Turbo.

Impact

Fine-tuning transformed AI from a field where each task required its own architecture and dataset into one where a single pre-trained model can be rapidly adapted to thousands of tasks. This has:

Democratised AI deployment: Organisations without massive compute budgets can fine-tune open-weight models on their domain data, achieving performance that previously required billions of dollars in pre-training.
Created the open-source model ecosystem: The ability to fine-tune released base models (LLaMA, Mistral, Qwen) spawned thousands of community-created specialised models on platforms like Hugging Face.
Enabled AI alignment: Instruction tuning and RLHF — both forms of fine-tuning — are the primary mechanisms for making raw language models safe and useful as assistants.
Reduced data requirements: Tasks that once needed millions of labelled examples can now be solved with hundreds, by building on pre-trained representations.

References

Donahue, J. et al. (2014). "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition". ICML 2014.
Howard, J. & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification". ACL 2018.
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL 2019.
Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models". ICLR 2022. arXiv:2106.09685.
Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback". NeurIPS 2022.
Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models". NeurIPS 2023. arXiv:2305.14314.
Zhou, C. et al. (2023). "LIMA: Less Is More for Alignment". NeurIPS 2023.
Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS 2023.
Wei, J. et al. (2022). "Finetuned Language Models Are Zero-Shot Learners". ICLR 2022.