LLaMA

From OpenEncyclopedia

Template:Infobox software

LLaMA (Large Language Model Meta AI) is a family of open-weight large language models developed by Meta AI, first released in February 2023. The LLaMA series is the most widely adopted foundation for open-source and open-weight AI development, with thousands of derivative models fine-tuned for instruction-following, coding, reasoning, and domain-specific applications. By releasing high-quality model weights under permissive licences, Meta fundamentally altered the competitive dynamics of the AI industry, establishing open-weight models as credible alternatives to proprietary systems from OpenAI, Anthropic, and Google DeepMind.

LLaMA 1 (February 2023)

The original LLaMA family was released on 24 February 2023 in four sizes: 7B, 13B, 33B, and 65B parameters.[1] All four models were decoder-only transformers trained on publicly available data — a deliberate choice to demonstrate that frontier-quality models could be built without proprietary datasets.

Architecture

LLaMA 1 incorporated several architectural refinements over the original GPT design:

  • Pre-normalisation with RMSNorm: layer normalisation applied before each sub-block rather than after (following GPT-3's convention), using Root Mean Square Layer Normalisation for efficiency.
  • SwiGLU activation: the feed-forward network used the SwiGLU activation function (Shazeer, 2020) instead of ReLU, improving training stability and downstream performance.
  • Rotary positional embeddings (RoPE): replaced absolute or learned positional encodings with rotary embeddings (Su et al., 2021), enabling better extrapolation to longer sequences.
  • Grouped-query attention (33B and 65B only): shared key-value heads across multiple query heads to reduce memory bandwidth during inference.

Training data

LLaMA 1 was trained on approximately 1.4 trillion tokens drawn entirely from publicly available sources:[1]

Source Proportion Description
CommonCrawl 67% Web text filtered with a classifier trained on Wikipedia references
C4 15% Google's Colossal Clean Crawled Corpus
GitHub 4.5% Public code repositories
Wikipedia 4.5% 20 languages
Books 4.5% Project Gutenberg and Books3
ArXiv 2.5% Scientific papers (LaTeX source)
StackExchange 2% Question-answer pairs

Performance

LLaMA 65B matched or exceeded GPT-3 (175B) on most benchmarks despite having less than half the parameters, and LLaMA 13B outperformed GPT-3 on several benchmarks — a striking demonstration of the Chinchilla scaling laws' prediction that smaller models trained on more data outperform larger models trained on less data.[1]

Release and leak

LLaMA 1 weights were initially released under a non-commercial research licence, restricted to approved academic researchers. Within a week of release, the weights were leaked via a torrent on 4chan, making them effectively public. This unintended release catalysed an explosion of open-source development, as researchers and hobbyists worldwide began fine-tuning and adapting the models.

Derivative models

The leak produced a rapid ecosystem of derivatives:

  • Alpaca (Stanford, March 2023): LLaMA 7B fine-tuned on 52K instruction-following examples generated by GPT-3.5, demonstrating that a small amount of instruction tuning could make a base model conversational.
  • Vicuna (LMSYS, March 2023): LLaMA 13B fine-tuned on ShareGPT conversations, achieving an estimated 90% of ChatGPT's quality.
  • WizardLM (Microsoft, April 2023): used "Evol-Instruct" to generate progressively more complex training examples.
  • CodeLlama (Meta, August 2023): official code-specialised variants fine-tuned on code data.

LLaMA 2 (July 2023)

Released on 18 July 2023, LLaMA 2 represented a major step toward genuine open access.[2]

Key changes

  • Sizes: 7B, 13B, and 70B parameters (the 33B size was dropped).
  • Training data: 2 trillion tokens — a 40% increase over LLaMA 1 — from an updated mix of publicly available data.
  • Context window: doubled from 2,048 to 4,096 tokens.
  • Grouped-query attention: extended to the 70B model, reducing KV-cache memory during inference.
  • Licence: the Llama 2 Community License permitted commercial use for organisations with fewer than 700 million monthly active users, a dramatic liberalisation from the research-only LLaMA 1 licence.

LLaMA 2-Chat

Meta simultaneously released LLaMA 2-Chat models, fine-tuned for dialogue using a combination of supervised fine-tuning (SFT) on human-written demonstrations and reinforcement learning from human feedback (RLHF) with a reward model trained on over one million human preference annotations. The RLHF process used rejection sampling followed by proximal policy optimisation (PPO), with iterative rounds of data collection and training.

The 70B Chat model was competitive with ChatGPT (GPT-3.5) on many human evaluation benchmarks, establishing that open-weight models could approach proprietary chat models in quality.

LLaMA 3 (April 2024)

LLaMA 3, released on 18 April 2024, marked another substantial leap in both scale and capability.[3]

Architecture and training

  • Sizes: 8B and 70B at launch; 405B released in July 2024 as LLaMA 3.1.
  • Tokeniser: switched from SentencePiece (32K vocabulary) to tiktoken-based with a 128K vocabulary, improving encoding efficiency for non-English languages and code.
  • Training data: over 15 trillion tokens — a 7.5× increase over LLaMA 2 — with significantly more multilingual and code data.
  • Context window: 8,192 tokens (extended to 128K in LLaMA 3.1 via continued pre-training with progressive context extension).
  • Grouped-query attention: used across all sizes with 8 KV heads.

LLaMA 3.1 (July 2024)

The LLaMA 3.1 release added the 405B model — the largest open-weight model available at time of release — alongside updated 8B and 70B variants with 128K context support. LLaMA 3.1 405B was competitive with GPT-4 and Claude 3.5 Sonnet on many benchmarks, representing a milestone for open-weight models.[4]

LLaMA 3.2 (September 2024)

LLaMA 3.2 introduced multimodal capabilities, with 11B and 90B vision-language models capable of processing images alongside text, as well as lightweight 1B and 3B text-only models designed for edge deployment and on-device inference.

LLaMA 3.3 (December 2024)

LLaMA 3.3 70B, released in December 2024, achieved performance comparable to LLaMA 3.1 405B on many text-based benchmarks through improved post-training, demonstrating substantial gains from alignment techniques without increasing model size.

LLaMA 4 (April 2025)

LLaMA 4, released in April 2025, represented Meta's first adoption of the mixture of experts (MoE) architecture for the LLaMA family.[5]

Models

  • Llama 4 Scout (17B active / 109B total): 16 experts per layer, top-1 routing, with an industry-leading 10-million-token context window.
  • Llama 4 Maverick (17B active / 400B total): 128 experts per layer with shared experts, optimised for quality on reasoning and coding tasks.
  • Llama 4 Behemoth (announced, not yet released): an even larger model intended to push the frontier further.

The MoE architecture allowed LLaMA 4 models to achieve high quality while keeping active inference compute comparable to much smaller dense models.

Ecosystem and impact

Open-weight movement

LLaMA's release is widely credited with catalysing the modern open-weight AI movement. Before LLaMA, open language models (GPT-J, GPT-NeoX, BLOOM) existed but trailed proprietary models by a significant quality margin. LLaMA demonstrated that with sufficient training data and modern architectural choices, open models could approach proprietary frontier systems.

The competitive pressure from LLaMA prompted other major labs to release open-weight models:

  • Mistral AI: Mistral 7B (September 2023), Mixtral 8×7B (December 2023)
  • Google: Gemma 2B/7B (February 2024), Gemma 2 (June 2024)
  • Alibaba: Qwen series (2023–2025)
  • DeepSeek: DeepSeek-V2 (2024), DeepSeek-V3 (2025)

Fine-tuning and adaptation

LLaMA models have become the default starting point for fine-tuning in the open-source community. Tools such as LoRA, QLoRA, and frameworks like Hugging Face Transformers, vLLM, and llama.cpp enable researchers and developers to adapt LLaMA models for specialised applications with modest compute budgets.

Quantisation and local inference

The LLaMA architecture's clean design made it a primary target for quantisation research. Libraries such as llama.cpp (Georgi Gerganov, March 2023), GPTQ, AWQ, and ExLlamaV2 enable running LLaMA models on consumer hardware. LLaMA 2 7B was among the first models to run usably on a smartphone, and LLaMA 3.2 1B/3B were explicitly designed for on-device deployment.

Licensing debate

Meta's licences have been criticised as not meeting the Open Source Initiative's definition of "open source" because they impose restrictions on large-scale commercial use (the 700M MAU threshold) and prohibit using model outputs to train competing models. Defenders argue that the licences are far more permissive than those of proprietary models and have enabled unprecedented access to frontier-quality AI.

See also

References

  1. 1.0 1.1 1.2 Touvron, Hugo, et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
  2. Touvron, Hugo, et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
  3. Meta AI (2024). "Introducing Meta Llama 3: The most capable openly available LLM to date." Meta AI Blog, 18 April 2024.
  4. Meta AI (2024). "Introducing Llama 3.1: Our most capable models to date." Meta AI Blog, 23 July 2024.
  5. Meta AI (2025). "Introducing Llama 4." Meta AI Blog, April 2025.