Programming Ocean Academy

Mohammed Fahd Al-Abrah

Mohammed Fahd Al-Abrah

AI/ML/DL Engineer

Introduction

Over the past several years, Transformer-based language models have evolved from task-specific NLP systems into general-purpose foundation models capable of reasoning, coding, multilingual understanding, and instruction following. This evolution did not arise from a single breakthrough, but from a systematic progression in architecture stability, mathematical formulation, training paradigms, and scaling strategy.

Early models such as GPT and BERT established the core Transformer variants— decoder-only for generation and encoder-only for representation learning— demonstrating that large-scale self-supervised pretraining could replace handcrafted features. Subsequent models, including GPT-2, GPT-3, and T5, showed that scaling model capacity and data unlocks emergent capabilities such as in-context learning and task generalization, often reducing or eliminating the need for task-specific fine-tuning.

As models grew larger, research shifted toward efficiency and predictability. Works such as PaLM and LLaMA demonstrated that careful architectural refinements and data-optimal training can outperform brute-force scaling, while InstructGPT and GPT-4 established alignment and human feedback as essential post-training stages. More recent models—Mistral, Mixtral, Gemma, and SmolLM2—highlight a new phase focused on inference efficiency, conditional computation, data-centric optimization, and responsible open release, enabling strong performance even at small or moderate scales.

Together, these models illustrate a converging picture: the Transformer architecture and its core mathematics have largely stabilized, while progress now depends on how efficiently models are trained, aligned, and deployed. This trajectory defines the modern foundation-model paradigm, where capability, efficiency, and responsibility are co-optimized rather than treated as separate goals.

Transformer Foundations

The Transformer is the core architecture behind all modern large language models discussed in this atlas. It was introduced to replace recurrent and convolutional models, relying on attention instead of recurrence for sequence modeling.

This design enables full parallel computation, stable scaling to very deep networks, and flexible adaptation across diverse tasks. Language is represented as token embeddings combined with positional information, allowing models to reason over both content and order.

Each Transformer layer contains two essential components: self-attention, which allows each token to selectively attend to other tokens in the sequence, and position-wise feed-forward networks, which apply nonlinear transformations independently to each token. Residual connections and normalization ensure stable optimization at scale.

Three canonical Transformer variants dominate modern language modeling: encoder-only models (such as BERT and ELECTRA) that use bidirectional attention for representation learning; decoder-only models (such as GPT, LLaMA, PaLM, and Mistral) that use causal attention for text generation and reasoning; and encoder–decoder models (such as T5) that combine both paradigms for sequence-to-sequence tasks.

The mathematical core of the Transformer has remained largely unchanged since its introduction. Progress has instead come from efficiency-driven refinements, including RMSNorm replacing LayerNorm, gated feed-forward networks such as SwiGLU and GeGLU, efficient attention variants (multi-query, grouped-query, and sliding-window attention), and sparse conditional computation via mixture-of-experts.

These refinements preserve the original Transformer structure while reducing memory usage, improving inference throughput, and increasing effective model capacity. Rather than being replaced, the Transformer has absorbed new ideas incrementally, becoming the universal backbone of foundation models.

The Transformer’s enduring success stems from its architectural simplicity, mathematical stability, and reliable scalability. As a result, modern progress focuses less on inventing new architectures and more on data quality and scaling, training paradigms, alignment and safety, and efficient deployment.

Improving Language Understanding by Generative Pre-Training

Radford et al. — 2018 — OpenAI Technical Report

Abstract

This paper introduces a semi-supervised framework for natural language understanding that combines unsupervised generative pre-training with supervised discriminative fine-tuning. A Transformer-based language model is first trained on large unlabeled text corpora and then adapted to diverse NLP tasks using minimal architectural modifications. The approach significantly improves performance across multiple benchmarks, establishing generative pre-training as an effective foundation for universal language understanding models.


Problems

  • Labeled datasets for many NLP tasks are scarce, expensive, and domain-specific.
  • Discriminatively trained models rely heavily on task-specific architectures, limiting transferability.
  • Unsupervised representation learning methods struggle with higher-level semantics and long-range dependencies.
  • No consensus exists on how to transfer unsupervised linguistic knowledge to downstream tasks effectively.

Proposed Solutions

  • Use unsupervised language modeling as a pre-training objective to learn general linguistic representations.
  • Adopt a Transformer decoder architecture to model long-range dependencies efficiently.
  • Introduce task-specific input transformations that linearize structured inputs.
  • Fine-tune the pre-trained model with minimal additional parameters, optionally retaining an auxiliary LM loss.

Purpose

The primary goal is to demonstrate that a single task-agnostic model, generatively pre-trained on unlabeled text, can be adapted to a wide range of natural language understanding tasks and outperform specialized discriminative models.


Methodology

The framework consists of two stages. First, a multi-layer Transformer language model is trained on the BooksCorpus dataset using an autoregressive language modeling objective. Second, the model is fine-tuned on supervised tasks by adding a linear output layer and applying task-specific input transformations. An auxiliary language modeling loss is optionally retained during fine-tuning to improve generalization and convergence.

Evaluation is performed on natural language inference, question answering, semantic similarity, and text classification benchmarks.


Results

  • State-of-the-art performance on 9 out of 12 evaluated benchmarks.
  • +8.9% improvement on the Story Cloze Test.
  • +5.7% improvement on RACE question answering.
  • +1.5% improvement on MultiNLI textual entailment.
  • Demonstrates strong zero-shot transfer capabilities.
  • Ablation studies confirm the necessity of both Transformer architecture and unsupervised pre-training.

Conclusions

Generative pre-training combined with discriminative fine-tuning provides a scalable and effective solution for natural language understanding. Transformers trained on large unlabeled corpora acquire rich semantic and structural knowledge that transfers across tasks with minimal adaptation. This work establishes the foundation of modern pretrained language models and shows that unsupervised learning can significantly reduce reliance on labeled data while improving performance.

Featured Paper

OpenAI Research

Improving Language Understanding by Generative Pre-Training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
OpenAI Technical Report · 2018
Introduced the two-stage paradigm of unsupervised generative pre-training followed by supervised fine-tuning using a Transformer decoder, establishing the foundation of the GPT family.

“This work established generative pre-training as a universal foundation for NLP, proving that a single Transformer language model could transfer across tasks with minimal architectural changes — a turning point that launched modern large language models.”
Download PDF Open in Google Colab

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al. — 2018 — NAACL 2019

Abstract

This paper introduces BERT, a language representation model based on deep bidirectional Transformers, designed to pre-train contextual representations from unlabeled text and fine-tune them for a wide range of downstream NLP tasks. Unlike prior unidirectional or shallowly bidirectional approaches, BERT jointly conditions on both left and right context at every layer. The model achieves state-of-the-art results on multiple benchmarks, demonstrating that deep bidirectional pre-training significantly improves language understanding.


Problems

  • Most pre-trained language models are unidirectional, limiting full contextual understanding.
  • Shallow bidirectional methods combine separate left-to-right and right-to-left models, which is suboptimal.
  • Many NLP systems rely on heavily engineered, task-specific architectures with poor generalization.
  • Existing pre-training objectives fail to model sentence-pair relationships required for inference and QA.

Proposed Solutions

  • Introduce deep bidirectional pre-training using a Transformer encoder architecture.
  • Replace standard language modeling with a masked language modeling (MLM) objective.
  • Add a next sentence prediction (NSP) objective to explicitly model sentence-level relationships.
  • Adopt a unified fine-tuning framework with minimal task-specific modifications.

Purpose

The purpose is to show that a single deeply bidirectional pre-trained model can serve as a universal language understanding backbone, reducing reliance on task-specific architectures while improving performance across diverse NLP tasks.


Methodology

BERT is pre-trained on large unlabeled corpora, including BooksCorpus and English Wikipedia, using two objectives. Masked language modeling randomly masks 15% of input tokens and trains the model to predict them, enabling bidirectional context conditioning. Next sentence prediction trains the model to distinguish coherent sentence pairs from random ones.

The architecture is a multi-layer bidirectional Transformer encoder. During fine-tuning, the same pre-trained model is adapted to downstream tasks by adding a simple output layer and fine-tuning all parameters end-to-end.


Results

  • State-of-the-art performance on 11 NLP benchmarks.
  • GLUE benchmark score improved to 80.5%.
  • MultiNLI accuracy increased to 86.7%.
  • SQuAD v1.1 F1 score reached 93.2.
  • SQuAD v2.0 F1 score reached 83.1.
  • Larger model variants consistently outperform smaller ones.
  • Ablation studies confirm the importance of bidirectional pre-training and NSP.

Conclusions

Deep bidirectional pre-training fundamentally enhances language representations and transfer learning in NLP. By combining masked language modeling and next sentence prediction within a Transformer encoder, BERT establishes a simple yet powerful framework that generalizes across tasks and data regimes. This work reshapes the standard approach to language understanding and lays the foundation for a new generation of pretrained language models.

Featured Paper

Google AI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
NAACL 2019 · 2018
Introduced deep bidirectional Transformer pre-training using masked language modeling and next sentence prediction, redefining contextual representation learning in NLP.

“BERT fundamentally changed language modeling by proving that deep bidirectional pre-training yields richer contextual representations, setting a new standard for transfer learning and becoming the backbone of modern NLP systems.”
Download PDF Open in Google Colab

Language Models are Unsupervised Multitask Learners

Radford et al. — 2019 — OpenAI Technical Report

Abstract

This paper demonstrates that sufficiently large language models trained on diverse unlabeled text can perform a wide range of natural language tasks in a zero-shot setting, without task-specific supervision or architectural modification. By scaling a Transformer-based language model and training it on a high-quality web corpus (WebText), the authors show that language modeling alone induces multitask learning behavior, achieving competitive or state-of-the-art results on several benchmarks.


Problems

  • NLP systems depend heavily on supervised, task-specific datasets that are costly and limited.
  • Models trained for narrow tasks generalize poorly beyond their training distribution.
  • Multitask and meta-learning approaches require explicitly curated objectives that do not scale well.
  • It is unclear whether a single unsupervised objective can support general-purpose language understanding.

Proposed Solutions

  • Train a high-capacity Transformer language model using only an unsupervised language modeling objective.
  • Use a large, diverse, and higher-quality web dataset (WebText) to expose the model to natural task patterns.
  • Leverage natural language prompting to specify tasks at inference time.
  • Scale model size to enable emergent zero-shot task performance.

Purpose

The purpose is to evaluate whether language modeling alone, when combined with sufficient data diversity and model capacity, can produce a generalist system capable of performing many NLP tasks without supervised fine-tuning.


Methodology

Four Transformer-based autoregressive language models, ranging from 117M to 1.5B parameters, are trained on the WebText dataset using a standard language modeling objective. The models use byte-level BPE tokenization and a GPT-style architecture with increased context length.

Evaluation is conducted entirely in zero-shot settings across language modeling, reading comprehension, question answering, summarization, translation, and commonsense reasoning. Tasks are specified through natural language prompts rather than task-specific heads.


Results

  • GPT-2 (1.5B parameters) achieves state-of-the-art results on 7 of 8 language modeling benchmarks.
  • Competitive zero-shot performance on reading comprehension (CoQA).
  • Improved results on the Winograd Schema Challenge.
  • Demonstrates non-trivial abilities in summarization, translation, and factual QA without supervision.
  • Performance scales approximately log-linearly with model size.

Conclusions

Unsupervised language modeling can implicitly induce multitask learning when paired with sufficient model capacity and diverse data. GPT-2 shows that many NLP tasks can be solved through prompting alone, challenging the assumption that explicit supervision is required for task learning. This work establishes scale and data diversity as key drivers of emergent capabilities and lays the groundwork for instruction-following and few-shot paradigms in later models.

Featured Paper

OpenAI Research

Language Models are Unsupervised Multitask Learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
OpenAI Technical Report · 2019
Demonstrated that large autoregressive Transformer language models trained purely with unsupervised language modeling can perform a wide range of NLP tasks in a zero-shot setting via natural language prompting.

“This work revealed that multitask learning can emerge naturally from scale and data diversity, establishing prompting as a mechanism for task specification and laying the groundwork for zero-shot, few-shot, and instruction-following language models.”
Download PDF Open in Google Colab

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

Raffel et al. — 2019 — JMLR 2020

Abstract

This paper introduces a unified framework for transfer learning in NLP that reformulates all text-based tasks into a single text-to-text format. By systematically studying architectures, pre-training objectives, datasets, transfer strategies, and scale, the authors show that large Transformer models trained on a clean, massive corpus can achieve state-of-the-art performance across a wide range of NLP benchmarks. The work culminates in the Text-to-Text Transfer Transformer (T5), establishing unification and scale as central principles for modern NLP.


Problems

  • Transfer learning approaches in NLP are fragmented across task-specific formulations and objectives.
  • Comparing methods is difficult due to inconsistent task representations and evaluation pipelines.
  • The relative importance of architecture, objective, data quality, and scale is unclear.
  • Existing benchmarks insufficiently capture general-purpose language understanding.

Proposed Solutions

  • Reformulate all NLP tasks into a unified text-to-text input–output format.
  • Use a single Transformer architecture and maximum-likelihood objective for all tasks.
  • Systematically evaluate design choices across objectives, datasets, architectures, and scaling regimes.
  • Introduce a large, clean pre-training dataset, the Colossal Clean Crawled Corpus (C4).

Purpose

The purpose is to rigorously explore the design space of transfer learning in NLP and identify scalable, principled practices that yield strong general-purpose language models across a wide range of tasks.


Methodology

Encoder–decoder Transformer models are trained under a unified text-to-text framework, where task identity is specified through natural-language prefixes. Pre-training is performed on the C4 dataset using a denoising objective based on span corruption.

Extensive experiments compare architectures, objectives, datasets, multi-task versus single-task fine-tuning, and model scaling up to 11 billion parameters. Evaluation spans GLUE, SuperGLUE, SQuAD, CNN/DailyMail summarization, and WMT translation benchmarks.


Results

  • Encoder–decoder Transformers with denoising objectives outperform decoder-only models.
  • The text-to-text formulation enables a single model to handle classification, generation, and reasoning.
  • Data quality and model scale have a strong, consistent impact on performance.
  • T5 achieves state-of-the-art results on multiple benchmarks at large scale.
  • Pre-training yields substantial gains, especially for low-resource and reasoning-intensive tasks.

Conclusions

A unified text-to-text framework simplifies transfer learning and enables fair comparison across methods while supporting broad generalization. The T5 paradigm demonstrates that architectural consistency, clean large-scale data, and model scaling are more influential than task-specific modeling choices. This work provides a foundational blueprint for general-purpose language models and strongly influences later instruction-tuned and multitask Transformer systems.

Featured Paper

Google Research

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
JMLR · 2020 (arXiv 2019)
Unified all NLP tasks into a single text-to-text format using an encoder–decoder Transformer, enabling systematic comparison of architectures, objectives, datasets, and scale under one framework.

“T5 reframed natural language processing as a single text-to-text problem, proving that unification and scale matter more than task-specific architectures and establishing a blueprint for general-purpose, instruction-driven language models.”
Download PDF Open in Google Colab

Language Models are Few-Shot Learners

Brown et al. — 2020 — NeurIPS 2020

Abstract

This paper demonstrates that scaling autoregressive language models enables strong few-shot, one-shot, and zero-shot learning without task-specific fine-tuning. By training GPT-3, a 175-billion-parameter Transformer language model, on a large and diverse text corpus, the authors show that a single model can perform a wide range of NLP tasks through natural-language prompting alone. The results indicate that model scale is a primary driver of in-context learning and task generalization.


Problems

  • Modern NLP systems rely heavily on supervised fine-tuning with task-specific datasets.
  • Fine-tuned models often generalize poorly beyond narrow training distributions.
  • Existing meta-learning and zero-shot methods underperform fine-tuned systems.
  • It is unclear whether a single pretrained model can adapt to new tasks using only text-based instructions.

Proposed Solutions

  • Scale Transformer-based language models to unprecedented sizes.
  • Train models purely with a next-token prediction objective.
  • Evaluate models via zero-shot, one-shot, and few-shot prompting instead of gradient updates.
  • Systematically analyze performance scaling across tasks and model sizes.

Purpose

The purpose is to determine whether large language models can perform meta-learning through in-context conditioning, reducing or eliminating the need for supervised fine-tuning while enabling flexible task adaptation.


Methodology

Eight GPT-style Transformer models, ranging from 125 million to 175 billion parameters, are trained on a filtered mixture of Common Crawl, WebText, books, and Wikipedia, totaling approximately 300 billion training tokens. Models use a fixed context window and standard autoregressive training.

Evaluation is conducted across more than two dozen NLP benchmarks. Tasks are specified using natural-language prompts that may include instructions and example demonstrations, with no parameter updates during evaluation. Performance is measured in zero-shot, one-shot, and few-shot regimes.


Results

  • Strong few-shot performance across translation, QA, reading comprehension, and reasoning tasks.
  • Few-shot accuracy improves rapidly with model scale.
  • The largest model achieves state-of-the-art results on several benchmarks without fine-tuning.
  • Weaknesses observed in sentence-pair comparison and structured reasoning tasks.
  • Scaling laws show smooth performance gains across zero-, one-, and few-shot settings.

Conclusions

Large language models can function as general-purpose learners by performing meta-learning implicitly through in-context conditioning. GPT-3 shows that task adaptation can emerge from scale alone, without architectural changes or supervised fine-tuning. Despite remaining limitations in reasoning and robustness, this work establishes few-shot prompting as a viable alternative to traditional fine-tuning and lays the groundwork for instruction-following and general-purpose language models.

Featured Paper

OpenAI Research

Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al.
NeurIPS 2020 · OpenAI · 2020
Demonstrated that scaling autoregressive Transformer language models enables strong zero-shot, one-shot, and few-shot learning via in-context prompting, without task-specific fine-tuning.

“GPT-3 showed that scale itself is a form of learning: a single language model can adapt to new tasks purely through context, redefining transfer learning and establishing prompting as a core paradigm in modern AI.”
Download PDF Open in Google Colab

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Clark et al. — 2020 — ICLR 2020

Abstract

This paper introduces ELECTRA, a self-supervised pre-training framework that replaces masked language modeling with a discriminative objective. Instead of predicting masked tokens, ELECTRA trains a discriminator to detect whether each token has been replaced by a generator. Because learning occurs over all tokens rather than a masked subset, ELECTRA significantly improves compute efficiency and yields stronger downstream performance than BERT-style pre-training.


Problems

  • Masked Language Modeling (MLM) learns from only a small fraction of tokens, reducing training efficiency.
  • MLM introduces a pre-train / fine-tune mismatch due to artificial [MASK] tokens.
  • High compute requirements limit accessibility and scalability of strong pre-training methods.
  • Prior approaches emphasize peak accuracy but often under-optimize for efficiency.

Proposed Solutions

  • Replace MLM with a discriminative objective: replaced token detection.
  • Train a small generator to propose plausible token replacements and a discriminator to classify tokens as original or replaced.
  • Apply loss over all tokens in the sequence to maximize learning signal per example.
  • Discard the generator after pre-training and fine-tune only the discriminator for downstream tasks.

Purpose

The purpose is to develop a more compute-efficient pre-training objective that produces high-quality contextual representations, lowering the cost barrier of large-scale NLP models while maintaining or improving downstream task performance.


Methodology

ELECTRA jointly trains two Transformer models: a generator (a small masked language model trained by maximum likelihood) and a discriminator trained to predict, for every token, whether it comes from the original data distribution or was replaced by the generator.

The discriminator uses a token-level binary classification loss applied to all positions in the input. After pre-training, only the discriminator is retained and fine-tuned. Evaluation is performed on GLUE and SQuAD across multiple model sizes with controlled compute budgets for fair comparison.


Results

  • ELECTRA outperforms BERT and other MLM-based models under matched compute and data settings.
  • Small ELECTRA models achieve strong results with limited resources, demonstrating high efficiency.
  • At larger scales, ELECTRA matches or exceeds strong baselines while using substantially less pre-training compute.
  • Gains are most pronounced for smaller models, indicating strong parameter efficiency.
  • Improvements primarily come from learning signal over all tokens rather than masking only a subset.

Conclusions

Training text encoders as discriminators rather than generators yields major efficiency and performance improvements. ELECTRA shows that pre-training objectives should optimize compute efficiency alongside accuracy. By reframing representation learning as token-level discrimination, this work provides a strong alternative to MLM and influences later research on efficient, accessible pre-training.

Featured Paper

Google Research

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
ICLR 2020 · Google Research · 2020
Introduced a discriminative pre-training objective based on replaced token detection, achieving significantly higher compute efficiency and stronger downstream performance than masked language modeling.

“ELECTRA proved that better pre-training objectives can outperform brute-force scale, showing that learning from every token yields superior efficiency and redefining how language representations can be trained under tight compute budgets.”
Download PDF Open in Google Colab

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. — 2022 — NeurIPS 2022

Abstract

This paper introduces InstructGPT, a method for aligning large language models with human intent using reinforcement learning from human feedback (RLHF). By fine-tuning GPT-3 with human-written demonstrations and preference rankings, the authors significantly improve instruction-following behavior, truthfulness, and safety. Remarkably, a 1.3B-parameter InstructGPT model is preferred by humans over the original 175B GPT-3, demonstrating that alignment can outweigh sheer model scale.


Problems

  • Standard language-model pretraining optimizes next-token prediction rather than user intent.
  • Large models often generate unhelpful, untruthful, or harmful responses.
  • Prompting alone is unreliable for consistent instruction following.
  • Increasing model scale does not inherently solve alignment or safety issues.

Proposed Solutions

  • Collect human-written demonstrations of desired model behavior.
  • Train a reward model using human preference rankings of model outputs.
  • Fine-tune the language model using PPO-based reinforcement learning from human feedback.
  • Mix reinforcement updates with the pretraining loss to reduce performance regressions.

Purpose

The purpose is to align large language models with human intentions, making them more helpful, honest, and harmless across a wide range of real-world instruction-based tasks.


Methodology

The approach consists of three stages. First, supervised fine-tuning is performed on approximately 13k human-written demonstrations of ideal responses. Second, a reward model is trained on roughly 33k human preference rankings to predict which outputs humans prefer.

Third, the supervised model is optimized using Proximal Policy Optimization (PPO) to maximize the learned reward, with a KL-divergence penalty to prevent over-optimization. Models with 1.3B, 6B, and 175B parameters are evaluated using human preference studies, safety benchmarks, and standard NLP datasets.


Results

  • Human evaluators strongly prefer InstructGPT outputs over GPT-3 across all model sizes.
  • A 1.3B InstructGPT model outperforms the 175B GPT-3 in human preference evaluations.
  • Hallucinations are reduced and truthfulness improves on TruthfulQA.
  • Toxic output generation decreases when models are instructed to be respectful.
  • Performance regressions on standard benchmarks are largely mitigated by mixing pretraining updates.

Conclusions

Fine-tuning language models with human feedback is a powerful and cost-effective alignment strategy. InstructGPT shows that alignment quality can dominate model scale, enabling smaller models to outperform much larger unaligned ones. This work establishes RLHF as a foundational technique for instruction-following models and directly shapes modern conversational AI systems.

Featured Paper

OpenAI Research

Training Language Models to Follow Instructions with Human Feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al.
NeurIPS 2022 · OpenAI · 2022
Introduced Reinforcement Learning from Human Feedback (RLHF) as a scalable alignment method, enabling language models to better follow instructions, improve truthfulness, and reduce harmful behavior.

“InstructGPT marked the transition from raw language modeling to aligned AI, proving that human feedback—not just scale—can dramatically improve usefulness, safety, and trustworthiness, even allowing smaller models to outperform much larger ones.”
Download PDF Open in Google Colab

PaLM: Scaling Language Modeling with Pathways

Chowdhery et al. — 2022 — arXiv

Abstract

This paper introduces PaLM (Pathways Language Model), a 540-billion-parameter dense Transformer trained using the Pathways system, which enables efficient scaling across thousands of TPU v4 chips. PaLM shows that continued scaling of autoregressive language models yields substantial gains in few-shot learning across language understanding, reasoning, multilingual tasks, and code generation. The model achieves state-of-the-art results on hundreds of benchmarks and exceeds average human performance on BIG-bench, revealing emergent reasoning capabilities at large scale.


Problems

  • Strong performance often requires task-specific fine-tuning of pretrained language models.
  • Few-shot learning was constrained by prior model sizes and training infrastructure.
  • Efficiently scaling dense models across large accelerator clusters posed major systems challenges.
  • It was unclear whether language-model scaling had reached diminishing returns, especially for reasoning.

Proposed Solutions

  • Train an extremely large dense autoregressive Transformer with 540B parameters.
  • Introduce Pathways, a distributed ML system enabling pipeline-free training across TPU pods.
  • Apply architectural optimizations such as multi-query attention, SwiGLU, parallel layers, and RoPE embeddings.
  • Evaluate primarily in few-shot settings to measure generalization from scale alone.

Purpose

The purpose is to systematically investigate the effects of extreme model scale on few-shot learning and to determine whether new reasoning and generalization capabilities emerge beyond previously explored limits.


Methodology

PaLM is a decoder-only Transformer trained on approximately 780 billion tokens from a diverse, high-quality corpus including web pages, books, code, and multilingual data. Three model sizes (8B, 62B, and 540B parameters) are trained under identical conditions.

Training is performed on 6144 TPU v4 chips using the Pathways system with two-pod data parallelism and no pipeline parallelism, enabling high hardware utilization. Evaluation focuses on few-shot performance across hundreds of benchmarks, including BIG-bench, reasoning datasets, multilingual translation, and code generation, along with analyses of scaling behavior, bias, and toxicity.


Results

  • PaLM 540B achieves state-of-the-art few-shot performance on most English NLP benchmarks.
  • The model outperforms prior systems on BIG-bench and exceeds average human performance on many tasks.
  • Large gains are observed in multi-step reasoning, especially with chain-of-thought prompting.
  • Some capabilities emerge discontinuously, appearing only at very large scale.
  • Strong multilingual and code-generation performance is achieved despite predominantly English training data.
  • Scaling improves some bias metrics, though toxicity and harmful stereotypes remain concerns.

Conclusions

The PaLM study provides compelling evidence that language-model scaling has not saturated and that new cognitive-like behaviors emerge at sufficient scale. Advances in systems design and architectural efficiency make training dense models at unprecedented size feasible. PaLM establishes large autoregressive Transformers as powerful few-shot learners and sets the stage for future work on emergent reasoning, alignment, and responsible deployment of large language models.

Featured Paper

Google Research / Google DeepMind

PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, et al.
arXiv · Google Research · 2022
Introduced a 540B-parameter dense Transformer trained with the Pathways system, demonstrating that extreme scale enables emergent reasoning, multilingual competence, and state-of-the-art few-shot learning without task-specific fine-tuning.

“PaLM provided decisive evidence that language-model scaling had not saturated, revealing emergent reasoning abilities at sufficient scale and establishing few-shot prompting as a reliable measure of general intelligence in language models.”
Download PDF Open in Google Colab

LLaMA: Open and Efficient Foundation Language Models

Touvron et al. — 2023 — Meta AI — arXiv

Abstract

This paper introduces LLaMA, a family of foundation language models ranging from 7B to 65B parameters, trained exclusively on publicly available data. The work demonstrates that smaller, well-trained models can match or surpass the performance of much larger proprietary systems. Notably, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, while LLaMA-65B competes with Chinchilla-70B and PaLM-540B, establishing a new efficiency frontier in large language modeling.


Problems

  • State-of-the-art language models rely on proprietary datasets, limiting openness and reproducibility.
  • Scaling parameters alone leads to high inference costs and limited deployability.
  • Existing scaling strategies often optimize training compute but ignore inference efficiency.
  • Open models historically lagged behind closed models in performance.

Proposed Solutions

  • Train compute-efficient dense Transformers at moderate scale (7B–65B parameters).
  • Use substantially more tokens per parameter following compute-optimal scaling laws.
  • Rely exclusively on public datasets to enable open release and reproducibility.
  • Adopt architectural refinements such as RMSNorm, SwiGLU, and RoPE embeddings.

Purpose

The purpose is to demonstrate that high-performance foundation models can be trained efficiently and openly, achieving competitive or superior results without extreme parameter counts or proprietary data, while optimizing for real-world inference budgets.


Methodology

LLaMA models are decoder-only Transformers trained on up to 1.4 trillion tokens from a curated mixture of public sources, including CommonCrawl, C4, GitHub, Wikipedia, books, arXiv, and StackExchange. Model sizes include 7B, 13B, 33B, and 65B parameters.

Training uses the AdamW optimizer, cosine learning-rate scheduling, large batch sizes, and architectural choices inspired by GPT-3 and PaLM, such as pre-normalization, SwiGLU activations, and rotary positional embeddings. Evaluation is conducted in zero-shot and few-shot settings across commonsense reasoning, question answering, reading comprehension, mathematics, code generation, and MMLU benchmarks.


Results

  • LLaMA-13B outperforms GPT-3 (175B) on most benchmarks despite being over 10× smaller.
  • LLaMA-65B matches or exceeds Chinchilla-70B and rivals PaLM-540B on many tasks.
  • Strong performance is observed in commonsense reasoning, QA, math, and code generation.
  • Light instruction fine-tuning yields large gains, with LLaMA-I outperforming several instruction-tuned baselines.
  • Bias and toxicity levels are comparable to or slightly improved over GPT-3 and OPT.
  • Training cost and carbon footprint are significantly lower than comparable large models.

Conclusions

LLaMA demonstrates that scaling data and training duration can outperform brute-force parameter scaling. The work proves that open, efficient foundation models can achieve state-of-the-art performance without proprietary data or extreme model sizes. LLaMA represents a pivotal shift toward accessible, reproducible, and inference-efficient language models, laying the foundation for future open research and large-scale deployment.

Featured Paper

Meta AI Research

LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, et al.
arXiv · Meta AI · 2023
Demonstrated that smaller, efficiently trained language models using more data per parameter can match or outperform much larger proprietary models, redefining compute-optimal scaling and establishing a new era of high-performance open foundation models.

“LLaMA proved that openness and efficiency can rival brute-force scale, showing that carefully trained models with public data can match the strongest closed systems and catalyzing the modern open-source LLM ecosystem.”
Download PDF Open in Google Colab

GPT-4 Technical Report

OpenAI — 2023 — Technical Report

Abstract

This report presents GPT-4, a large-scale multimodal Transformer-based language model capable of processing both text and image inputs while generating text outputs. GPT-4 demonstrates human-level performance on a wide range of academic and professional benchmarks, substantially outperforming GPT-3.5. A central contribution is the introduction of predictable scaling infrastructure, enabling reliable performance forecasting from smaller models. The work also places strong emphasis on alignment, safety, and risk mitigation, outlining both advances and remaining limitations.


Problems

  • Scaling large language models involves unpredictable training dynamics and high financial risk.
  • Earlier models exhibited unreliable reasoning, hallucinations, and weak factual grounding.
  • Increased capability raises safety and misuse risks, including bias and harmful content generation.
  • Existing benchmarks inadequately measured real-world, human-level competence.

Proposed Solutions

  • Develop predictable scaling laws and infrastructure to extrapolate performance from smaller models.
  • Train a multimodal Transformer capable of joint reasoning over text and images.
  • Apply extensive post-training alignment using Reinforcement Learning from Human Feedback (RLHF).
  • Deploy a model-assisted safety pipeline with expert red-teaming and rule-based reward models.

Purpose

The purpose is to characterize the capabilities, limitations, and safety properties of GPT-4, while demonstrating that large, general-purpose AI systems can be trained reliably, predictably, and safely at scale.


Methodology

GPT-4 is a Transformer-based autoregressive model pretrained on a mixture of publicly available and licensed data using next-token prediction, followed by extensive RLHF alignment. The training process incorporates predictable scaling, allowing loss and downstream performance to be forecast from models trained with 1,000–10,000× less compute.

Evaluation spans academic exams, professional licensing tests, standard NLP benchmarks, multilingual tasks, and code-generation benchmarks. The model supports multimodal inputs with interleaved text and images. Safety evaluation combines expert adversarial testing, RLHF, and rule-based reward modeling.


Results

  • GPT-4 achieves top-10% human performance on a simulated Uniform Bar Exam.
  • Outperforms prior models on MMLU, HumanEval, GSM-8K, HellaSwag, and ARC.
  • Demonstrates strong multilingual performance, often surpassing English SOTA in translated benchmarks.
  • Significantly reduced hallucinations and improved factuality compared to GPT-3.5.
  • Safety interventions reduce incorrect behavior on disallowed content by over 80%.
  • Predictive scaling laws accurately forecast training loss and emergent capabilities.

Conclusions

GPT-4 marks a major advance in general-purpose, multimodal language modeling. The work shows that predictable scaling and rigorous alignment can yield human-level performance across diverse domains. While meaningful progress is made in reliability and safety, limitations such as hallucinations, calibration gaps, and residual risks remain. This report establishes a foundation for scalable, forecastable, and safer AI systems, while underscoring the need for continued research in robustness, alignment, and societal impact.

Featured Paper

OpenAI Research

GPT-4 Technical Report
OpenAI
OpenAI Technical Report · 2023
Introduced a large-scale multimodal Transformer capable of reasoning over text and images, achieving human-level performance on professional and academic benchmarks and establishing predictable scaling as a core engineering principle.

“GPT-4 marked the transition from powerful language models to general-purpose, multimodal reasoning systems, demonstrating that capability, predictability, and alignment can be jointly engineered at scale.”
Download PDF Open in Google Colab

Mistral 7B

Jiang et al. — 2023 — arXiv — Apache 2.0 License

Abstract

This paper introduces Mistral 7B, a 7-billion-parameter Transformer language model designed to deliver state-of-the-art performance with high inference efficiency. Despite its compact size, Mistral 7B outperforms all previously released open 13B models and surpasses some 34B models on reasoning, mathematics, and code generation benchmarks. The model combines grouped-query attention and sliding window attention to significantly reduce inference cost while preserving strong capabilities. An instruction-tuned variant further demonstrates competitive conversational performance.


Problems

  • State-of-the-art language modeling has relied heavily on scaling parameter counts, increasing inference cost.
  • Open models historically lagged behind closed or much larger systems in performance.
  • Standard self-attention scales quadratically with sequence length, limiting long-context usage.
  • Efficiency-oriented models often sacrifice reasoning, mathematics, or code performance.

Proposed Solutions

  • Design a compact yet high-capacity Transformer optimized for inference efficiency.
  • Introduce grouped-query attention (GQA) to reduce key-value memory and accelerate decoding.
  • Apply sliding window attention (SWA) to enable long effective context with reduced computation.
  • Optimize KV-cache usage through a rolling buffer mechanism.
  • Demonstrate that reasoning and knowledge can be compressed more efficiently than assumed.

Purpose

The purpose is to show that strong reasoning and general-purpose language understanding do not require large parameter counts, and that careful architectural design can produce models that are both highly capable and deployable at scale.


Methodology

Mistral 7B is a decoder-only Transformer with 32 layers and a hidden dimension of 4096. Key architectural elements include grouped-query attention, which uses fewer key-value heads than attention heads, and sliding window attention with a window size of 4096 tokens.

A rolling buffer KV-cache limits memory growth during long-sequence inference, enabling an effective receptive field exceeding 100k tokens across layers. The model is pretrained autoregressively on large-scale text data and evaluated across commonsense reasoning, world knowledge, reading comprehension, mathematics, code generation, and aggregated benchmarks such as MMLU and BBH. Public instruction fine-tuning yields Mistral 7B-Instruct.


Results

  • Mistral 7B outperforms LLaMA 2 13B across all evaluated benchmarks.
  • Surpasses LLaMA 1 34B on mathematics, reasoning, and code generation tasks.
  • Achieves MMLU and reasoning performance comparable to models more than 3× its size.
  • Mistral 7B-Instruct matches or exceeds several 13B chat models in evaluations.
  • Sliding window attention provides up to 2× inference speedup and up to 8× KV-cache memory reduction.
  • Safety experiments show high refusal accuracy on unsafe prompts via system prompting.

Conclusions

Mistral 7B demonstrates that architectural efficiency is a critical axis alongside training cost and model size. The results challenge the assumption that strong reasoning and code capabilities require massive parameter counts. By combining efficient attention mechanisms with careful engineering, Mistral 7B establishes a new baseline for high-performance, open, and deployable language models, strongly influencing subsequent work on compact and modular LLM architectures.

Featured Paper

Mistral AI

Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, et al.
arXiv · Mistral AI · 2023
Introduced a compact 7B-parameter Transformer that achieves state-of-the-art performance among open models at its size by combining grouped-query attention and sliding window attention for dramatically improved inference efficiency.

“Mistral 7B shattered the assumption that strong reasoning and code capabilities require large models, proving that architectural efficiency is as powerful as scale and redefining what ‘small’ models can achieve.”
Download PDF Open in Google Colab

LLaMA 2: Open Foundation and Fine-Tuned Chat Models

Touvron et al. — 2023 — Meta AI — arXiv

Abstract

This paper introduces LLaMA 2, a family of open foundation language models ranging from 7B to 70B parameters, alongside LLaMA 2-Chat, instruction-tuned variants optimized for dialogue. Trained on substantially more data than LLaMA 1 and released under a commercially permissive license, LLaMA 2 achieves competitive performance with leading closed models while emphasizing safety, transparency, and accessibility. The work establishes a strong open alternative for both research and deployment.


Problems

  • High-performing language models are predominantly closed and proprietary.
  • Earlier open models lacked instruction-following and conversational alignment.
  • Safety and alignment properties were insufficiently explored in open releases.
  • Scaling open models to competitive performance while maintaining efficiency was challenging.

Proposed Solutions

  • Train larger and better-optimized open models up to 70B parameters.
  • Introduce LLaMA 2-Chat, fine-tuned specifically for dialogue and instruction following.
  • Apply systematic safety evaluations and red-teaming prior to release.
  • Release models under a commercial-friendly license to encourage adoption.

Purpose

The purpose is to demonstrate that open and responsibly released language models can achieve performance comparable to proprietary systems, while supporting transparent research, commercial use, and safety analysis.


Methodology

LLaMA 2 models are decoder-only Transformers trained on approximately 2 trillion tokens, doubling the training data used for LLaMA 1. Model sizes include 7B, 13B, 34B, and 70B parameters, pretrained on a curated mixture of public text sources.

Instruction tuning is performed through a combination of supervised fine-tuning using human-written prompts and responses, followed by reinforcement learning from human feedback (RLHF) to optimize helpfulness and safety. Evaluation spans reasoning, coding, knowledge, and dialogue benchmarks, alongside safety testing using red-team adversarial prompts and automated metrics.


Results

  • LLaMA 2-70B matches or exceeds GPT-3.5-level performance on many benchmarks.
  • LLaMA 2-Chat models are strongly preferred over LLaMA 1 and other open chat models.
  • Instruction tuning substantially improves helpfulness, coherence, and safety.
  • Safety evaluations show significant reductions in harmful and toxic outputs.
  • Performance scales predictably with model size, confirming effective training improvements.

Conclusions

LLaMA 2 demonstrates that open, commercially usable language models can reach competitive performance with proprietary systems when trained at sufficient scale and aligned using human feedback. By combining strong base models, instruction tuning, and rigorous safety evaluation, this work establishes a new standard for open LLM releases and accelerates open research, enterprise adoption, and the broader language model ecosystem.

Featured Paper

Meta AI Research

LLaMA 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter E. Albert, et al.
arXiv · Meta AI · 2023
Introduced a new generation of open foundation models (7B–70B) and instruction-tuned chat variants trained with supervised fine-tuning and RLHF, delivering competitive performance with leading proprietary systems under a commercially permissive license.

“LLaMA 2 established that open, responsibly released language models can be both competitive and safe, proving that alignment, scalability, and commercial usability need not be exclusive to closed systems.”
Download PDF Open in Google Colab

Mixtral of Experts

Jiang et al. — 2024 — arXiv

Abstract

This paper introduces Mixtral 8×7B, a Sparse Mixture-of-Experts (SMoE) decoder-only Transformer that achieves frontier-level performance among open-weight models with substantially lower inference cost. Although the model contains 47B total parameters, only 13B parameters are active per token via sparse expert routing. Mixtral matches or surpasses LLaMA 2 70B and GPT-3.5 across most benchmarks, with particularly strong gains in mathematics, code generation, multilingual tasks, and long-context reasoning. An instruction-tuned variant further excels in human evaluations.


Problems

  • Dense scaling leads to high inference cost and latency, limiting deployability.
  • Open models lagged behind closed systems in reasoning, code, and multilingual performance.
  • Standard Transformers scale computation inefficiently with parameters and context length.
  • Prior MoE approaches had not achieved state-of-the-art performance in open models.

Proposed Solutions

  • Replace dense feed-forward layers with Sparse Mixture-of-Experts (MoE) layers.
  • Use a router network to select the top-2 experts per token at each layer.
  • Increase total parameter capacity without increasing per-token compute.
  • Pretrain with a 32k token context window for long-range reasoning.
  • Apply instruction fine-tuning using SFT and Direct Preference Optimization (DPO).

Purpose

The purpose is to demonstrate that conditional computation via sparse expert routing can deliver frontier-level language modeling performance while dramatically reducing inference cost, making high-capability LLMs more efficient and accessible.


Methodology

Mixtral 8×7B is a decoder-only Transformer with 32 layers, where each feed-forward block is replaced by an MoE layer containing 8 experts. For every token at each layer, a gating network selects 2 experts, resulting in 13B active parameters per token out of 47B total parameters.

The model is pretrained on large-scale multilingual data with a 32k context length. Evaluation covers commonsense reasoning, world knowledge, reading comprehension, mathematics, code generation, multilingual benchmarks, and long-context retrieval. An instruction-tuned variant is trained with supervised fine-tuning followed by DPO. Routing behavior and expert specialization are analyzed to assess efficiency and locality.


Results

  • Matches or outperforms LLaMA 2 70B and GPT-3.5 using 5× fewer active parameters.
  • Substantial gains in mathematics and code generation over larger dense models.
  • Multilingual performance exceeds LLaMA 2 70B in several major languages.
  • Near-perfect long-context retrieval accuracy across 32k token contexts.
  • Instruction-tuned model outperforms GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and LLaMA 2-Chat in human evaluations.
  • Bias analyses indicate reduced bias and more balanced sentiment than LLaMA 2 70B.

Conclusions

Mixtral establishes sparse Mixture-of-Experts models as a practical alternative to dense scaling. By activating only a small subset of parameters per token, Mixtral achieves frontier-level performance with dramatically improved efficiency. The results show that capacity, not density, drives advanced reasoning, positioning sparse architectures as a central direction for scalable, open, and deployable language models.

Featured Paper

Mistral AI

Mixtral of Experts
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, et al.
arXiv · Mistral AI · 2024
Introduced a sparse Mixture-of-Experts (MoE) decoder-only Transformer (8×7B) that activates only a subset of parameters per token, achieving frontier-level performance with dramatically reduced inference cost and strong gains in math, code, multilingual tasks, and long-context reasoning.

“Mixtral demonstrated that conditional computation beats dense scaling: by activating capacity only when needed, sparse experts deliver state-of-the-art performance with far greater efficiency, redefining how large language models should scale.”
Download PDF Open in Google Colab

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team — Google DeepMind — 2024

Abstract

This paper introduces Gemma, a family of lightweight open-weight language models derived from the research and training recipes behind Google DeepMind’s Gemini models. Released in 2B and 7B parameter sizes, Gemma achieves state-of-the-art performance among open models at comparable scale across language understanding, reasoning, mathematics, coding, and safety benchmarks. The work emphasizes responsible open release, combining strong capability with extensive safety evaluation, low memorization rates, and transparent reporting.


Problems

  • Small and medium-scale open models historically lag behind closed systems in reasoning, math, and code.
  • Many open releases lack rigorous safety, bias, and memorization evaluations.
  • Scaling models increases compute cost and environmental impact, limiting accessibility.
  • Instruction tuning often trades off helpfulness, safety, and factuality.

Proposed Solutions

  • Train compact but high-capacity Transformer decoders (2B and 7B) using Gemini-inspired recipes.
  • Use large-scale pretraining (up to 6T tokens) with careful data filtering and staging.
  • Release both base and instruction-tuned checkpoints for flexibility and research.
  • Apply systematic safety, bias, and memorization evaluations before release.
  • Adopt a responsible open-model framework with detailed documentation and guidance.

Purpose

The purpose of this work is to demonstrate that small, openly available language models can achieve strong reasoning, coding, and instruction-following performance while remaining efficient, safe, and responsibly released, lowering barriers to research, deployment, and innovation.


Methodology

Gemma models are decoder-only Transformer architectures with an 8k context window, trained primarily on English text, code, and mathematics data. Two model sizes are released: 2B and 7B parameters.

Key architectural elements include rotary positional embeddings (RoPE), GeGLU activations, RMSNorm, and multi-query attention (for the 2B model). Training is conducted on TPUv5e pods using Pathways-style distributed infrastructure. Instruction tuning is performed via supervised fine-tuning followed by reinforcement learning from human feedback (RLHF). Evaluation spans academic benchmarks, human preference studies, safety metrics, and memorization tests.


Results

  • Gemma 7B outperforms similarly sized open models such as LLaMA 2 7B and Mistral 7B.
  • Strong gains in mathematics benchmarks (GSM8K, MATH) and code generation (HumanEval, MBPP).
  • Instruction-tuned Gemma achieves higher human preference win rates than Mistral 7B Instruct.
  • Low memorization rates comparable to PaLM and PaLM 2, with no sensitive data leakage detected.
  • Improved robustness against toxicity, bias, and harmful outputs relative to comparable open models.
  • Training conducted with a moderate carbon footprint using carbon-neutral data centers.

Conclusions

Gemma demonstrates that carefully trained, lightweight open language models can deliver strong reasoning, coding, and instruction-following capabilities without large parameter counts. By combining Gemini-derived training practices with rigorous safety and memorization analysis, this work sets a new standard for responsible open LLM releases and provides a strong foundation for efficient, transparent, and accessible language modeling.

Featured Paper

Google DeepMind

Gemma: Open Models Based on Gemini Research and Technology
Gemma Team — Google DeepMind
Technical Report · 2024
Introduced lightweight open-weight language models (2B and 7B) derived from Gemini research, achieving state-of-the-art performance at their scale while emphasizing safety, low memorization, and responsible open release.

“Gemma showed that openness, safety, and strong reasoning are not mutually exclusive, setting a new standard for small, responsible language models inspired by frontier research.”
Download PDF Open in Google Colab

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

Ben Allal et al. — Hugging Face — 2025

Abstract

This paper introduces SmolLM2, a family of small language models (135M, 360M, and 1.7B parameters) that achieve state-of-the-art performance within their size class by prioritizing data quality, dataset curation, and multi-stage training over architectural scaling. By training the 1.7B model on approximately 11 trillion tokens of carefully curated web, math, code, and instruction-following data—much of it newly created—the authors show that data-centric optimization can compensate for limited model capacity. SmolLM2 outperforms comparable small models such as Qwen2.5-1.5B and LLaMA-3.2-1B across reasoning, knowledge, and instruction-following benchmarks.


Problems

  • Small language models are highly sensitive to noisy or low-quality training data.
  • Public datasets for math, code, and instruction following are often poorly filtered or misaligned.
  • Single-stage pretraining mixtures fail to optimally allocate limited model capacity.
  • Small models underperform on reasoning-heavy benchmarks despite long training runs.

Proposed Solutions

  • Adopt a data-centric training paradigm focused on quality, relevance, and curriculum design.
  • Introduce new high-quality datasets: FineMath (math reasoning), Stack-Edu (educational code), and SmolTalk (instructions).
  • Use multi-stage pretraining with dynamic dataset rebalancing based on observed weaknesses.
  • Apply late-stage upsampling of math and code data to maximize their impact.
  • Release models, datasets, and tooling to ensure full reproducibility.

Purpose

The purpose of this work is to demonstrate that small language models can achieve state-of-the-art performance through careful data design and training strategy, enabling efficient deployment in resource-constrained environments such as edge devices and mobile systems.


Methodology

SmolLM2 models use a decoder-only Transformer architecture inspired by LLaMA. The flagship 1.7B-parameter model is trained on approximately 11 trillion tokens using a four-stage curriculum.

Stage 1 focuses on web-heavy training using a FineWeb-Edu and DCLM mixture to build broad knowledge. Stage 2 increases the proportion of code data and introduces early math data. Stage 3 replaces raw code with Stack-Edu and upgrades math data quality. Stage 4 anneals training with heavy upsampling of FineMath and code while decaying the learning rate. Post-training includes supervised fine-tuning on SmolTalk and preference alignment using Direct Preference Optimization (DPO). Context length is extended from 2k to 8k tokens late in training without loss in performance.


Results

  • SmolLM2-1.7B outperforms Qwen2.5-1.5B and LLaMA-3.2-1B on most reasoning and knowledge benchmarks.
  • Large gains in mathematics and code generation follow the introduction of FineMath and Stack-Edu.
  • Instruction-tuned SmolLM2 achieves strong dialogue and instruction-following performance on IFEval.
  • Smaller variants (360M and 135M) are state-of-the-art within their parameter ranges.
  • Long-context performance remains stable after extending to an 8k token window.
  • Results show that data quality and training duration outweigh parameter scaling for small models.

Conclusions

SmolLM2 demonstrates that small language models can achieve strong generalization, reasoning, and instruction-following capabilities through data-centric optimization. By shifting focus from architectural scaling to dataset quality, curation, and curriculum design, this work establishes a clear blueprint for efficient, open, and deployable language models and reinforces data-centric training as a core paradigm for the next generation of small and edge-friendly LMs.

Featured Paper

Hugging Face Research

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model
Leandro von Werra, Younes Belkada, Thomas Wolf, Ben Allal, et al.
arXiv · Hugging Face Research · 2025
Demonstrated that small language models (135M–1.7B) can achieve state-of-the-art performance through data-centric optimization, multi-stage curricula, and high-quality math, code, and instruction datasets—without architectural scaling.

“SmolLM2 proved that data—not parameters—is the dominant lever for small models, establishing data-centric training as a blueprint for efficient, deployable, and open language models in the post-scale era.”
Download PDF Open in Google Colab

Architectural Evolution of Transformer-Based Language Models


1. GPT (2018)

  • Decoder-only Transformer
  • Causal self-attention (left-to-right)
  • Learned positional embeddings
  • Standard feed-forward network with ReLU
  • Pre-norm not yet standard
  • Established the autoregressive text generation paradigm

2. BERT (2018)

  • Encoder-only Transformer
  • Fully bidirectional self-attention
  • Masked Language Modeling enables bidirectional context
  • Segment embeddings for sentence-pair tasks
  • Optimized for representation learning rather than generation

3. GPT-2 (2019)

  • Scaled decoder-only Transformer
  • Increased depth and width
  • Byte-level BPE tokenizer
  • Improved layer normalization placement
  • Demonstrated emergent abilities driven by scale

4. T5 (2019)

  • Encoder–decoder Transformer
  • Unified text-to-text architecture
  • Shared embeddings between encoder and decoder
  • Relative positional encodings
  • Strong performance on multi-task and sequence transformation problems

5. GPT-3 (2020)

  • Very large decoder-only Transformer
  • No architectural novelty; extreme scaling
  • Fixed context window
  • In-context learning emerges from architecture and scale alone
  • Validated architectural stability under massive scaling

6. ELECTRA (2020)

  • Encoder-only Transformer
  • Architecture largely identical to BERT
  • Primary innovation is the training objective, not structure
  • Efficient representation learning with the same backbone

7. InstructGPT (2022)

  • Same decoder-only architecture as GPT-3
  • No architectural modifications
  • Alignment achieved entirely via post-training
  • Demonstrates architecture is alignment-agnostic

8. PaLM (2022)

  • Large decoder-only Transformer
  • Introduced multi-query attention (single KV head, multiple Q heads)
  • SwiGLU feed-forward networks
  • Rotary positional embeddings (RoPE)
  • Optimized for large-scale parallel training
  • Improved compute and memory efficiency

9. LLaMA (2023)

  • Decoder-only Transformer refined for efficiency
  • RMSNorm replaces LayerNorm
  • SwiGLU activations
  • Rotary positional embeddings (RoPE)
  • Pre-norm applied consistently
  • Trained with more tokens per parameter
  • Demonstrated that efficient design outperforms brute-force scale

10. GPT-4 (2023)

  • Decoder-only Transformer (details undisclosed)
  • Multimodal input support (text and images)
  • Architecture remains stable
  • Innovation centers on predictable scaling and alignment
  • Confirms maturity of the Transformer architecture

11. Mistral 7B (2023)

  • Decoder-only Transformer
  • Grouped-Query Attention (GQA)
  • Sliding Window Attention (SWA)
  • Rolling KV cache for long-context efficiency
  • Strong focus on inference-time optimization

12. LLaMA 2 (2023)

  • Same core architecture as LLaMA
  • Minor refinements for training stability
  • Architecture unchanged
  • Performance gains driven by data scale and alignment
  • Confirms architectural saturation

13. Mixtral 8×7B (2024)

  • Sparse Mixture-of-Experts Transformer
  • Dense self-attention with MoE feed-forward layers
  • Top-2 expert routing per token
  • Large total capacity with low active compute
  • Introduced conditional computation at scale

14. Gemma (2024)

  • Decoder-only Transformer derived from Gemini research
  • Compact, efficiency-focused design
  • Rotary positional embeddings (RoPE)
  • RMSNorm
  • GeGLU / SwiGLU activations
  • Multi-query attention in smaller variants
  • Optimized for small-model performance and safety

15. SmolLM2 (2024)

  • Decoder-only Transformer (LLaMA-style)
  • Architecture unchanged
  • Context length extended late in training
  • Performance driven by data scheduling and curriculum
  • Confirms data dominates architecture for small models

Macro-Level Architectural Trends

  • 2018–2019: Establishment of encoder-only, decoder-only, and encoder–decoder forms
  • 2020–2021: Validation that the base Transformer scales reliably
  • 2022: Efficiency-driven refinements (RoPE, SwiGLU, multi-query attention)
  • 2023: Inference optimization (GQA, sliding window attention)
  • 2024: Conditional computation (MoE) and small-model efficiency
  • Current state: Architecture largely converged; innovation shifts to data, alignment, systems, and inference efficiency

One-Line Synthesis

Transformers evolved from discovering the right architecture to optimizing how efficiently and conditionally that same architecture can think.

Alignment & RLHF

Alignment refers to training language models to follow human intent, preferences, and safety constraints, beyond raw language modeling. Pretrained models optimize next-token prediction, which does not guarantee helpfulness, truthfulness, or harmlessness. As model scale increases, misalignment risks such as hallucinations, harmful content generation, and overconfidence become more pronounced.

Reinforcement Learning from Human Feedback (RLHF) emerged as the dominant paradigm for alignment. A key conceptual shift introduced by RLHF is the separation of capability learning from behavior shaping: models first learn language and reasoning during pretraining, and are then aligned to human expectations during post-training.

The standard RLHF pipeline consists of three stages. First, in Supervised Fine-Tuning (SFT), the model is trained on human-written demonstrations of ideal responses. Second, in Reward Modeling (RM), a separate model learns to score outputs based on human preference rankings. Third, during Policy Optimization, the language model is optimized to maximize the learned reward—typically using Proximal Policy Optimization (PPO)—while remaining close to the pretrained distribution via a KL regularization constraint.

Later systems introduced simplified alignment methods. Notably, Direct Preference Optimization (DPO) removes the need for an explicit reward model and directly optimizes preference comparisons. This approach is used in models such as Mixtral and several later open instruction-tuned systems.

Alignment primarily modifies how models behave rather than what they know. Its effects are most visible in output tone and structure, instruction- following reliability, refusal behavior on unsafe prompts, and truthfulness and verbosity. Core language modeling abilities and reasoning capacity remain largely unchanged by alignment alone.

Importantly, alignment is not emergent from scale. Smaller aligned models, such as InstructGPT at 1.3B parameters, can outperform much larger unaligned models in human preference evaluations. This demonstrates that alignment quality can dominate raw parameter count.

Safety alignment typically includes harmful content reduction, bias mitigation, extensive red-teaming and adversarial evaluation, and post-training policy enforcement. These measures aim to reduce real-world risks without excessively constraining model usefulness.

Alignment introduces trade-offs. Over-alignment can reduce creativity or factual recall, while under-alignment increases the risk of unsafe or misleading outputs. Modern systems balance these tensions using mixed objectives, careful regularization, and iterative human feedback.

Alignment has become a mandatory post-training phase for deployment-ready language models. Current research focuses on improving sample efficiency in preference learning, robustness to reward hacking, and scalable human- and AI-assisted feedback pipelines.

Key insight: Alignment shapes how a model responds, not what it knows.

Principal Mathematical Equations of Transformer-Based Models


1. Token Embedding and Positional Encoding

All models start by mapping tokens to vectors and injecting position information.

$$ \mathbf{X}_0 = \mathbf{E}(\text{tokens}) + \mathbf{P} $$
  • \(\mathbf{E}\): token embedding matrix
  • \(\mathbf{P}\): positional encoding
    • Learned (GPT, BERT)
    • Relative (T5)
    • Rotary (PaLM, LLaMA, Mistral, Gemma)

2. Self-Attention (Core Transformer Equation)

The fundamental computation shared by all Transformer-based models.

$$ \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$ with: $$ Q = XW_Q,\quad K = XW_K,\quad V = XW_V $$
  • Causal mask applied for GPT-style models
  • Fully bidirectional attention for BERT and T5 encoders

3. Multi-Head Attention

Parallel attention subspaces improve expressivity.

$$ \text{MHA}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W_O $$ $$ \text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i) $$

4. Efficient Attention Variants

Multi-Query Attention (PaLM, Gemma):

$$ K,V \text{ shared across heads},\quad Q \text{ remains per-head} $$

Grouped-Query Attention (Mistral):

$$ h_Q > h_{KV}, \quad \text{KV shared by head groups} $$

Sliding Window Attention (Mistral):

$$ \text{Attention}_t = \text{Attention}(Q_t, K_{t-w:t}, V_{t-w:t}) $$

5. Feed-Forward Network (FFN)

Applies non-linear transformation independently per token.

Standard FFN (GPT, BERT):

$$ \text{FFN}(x) = W_2\,\sigma(W_1 x) $$

SwiGLU / GeGLU (PaLM, LLaMA, Gemma):

$$ \text{FFN}(x) = W_2\left(\sigma(W_1 x) \odot W_3 x\right) $$

6. Residual Connections and Normalization

Residual pathways stabilize deep Transformer training.

$$ x_{l+1} = x_l + \mathcal{F}(\text{Norm}(x_l)) $$
  • LayerNorm (GPT, BERT)
  • RMSNorm (LLaMA-family, Gemma)

7. Decoder-Only Autoregressive Objective (GPT-family)

All GPT-like models optimize next-token prediction.

$$ \mathcal{L}_{\text{LM}} = -\sum_{t} \log p\!\left(x_t \mid x_{t}\right) $$


8. Masked Language Modeling (BERT)

Bidirectional pretraining via token corruption.

$$ \mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log p(x_i \mid x_{\setminus \mathcal{M}}) $$

9. Text-to-Text Denoising Objective (T5)

Span corruption with sequence reconstruction.

$$ \mathcal{L}_{\text{T5}} = -\log p(\text{target span} \mid \text{corrupted input}) $$

10. Replaced Token Detection (ELECTRA)

Discriminative pretraining objective.

$$ \mathcal{L}_{\text{RTD}} = -\sum_i \left[ y_i \log D(x_i) + (1-y_i)\log(1-D(x_i)) \right] $$

11. Mixture-of-Experts Routing (Mixtral)

Sparse conditional computation.

$$ y = \sum_{i \in \text{Top-}k} p_i(x)\,\text{Expert}_i(x) $$ where: $$ p_i(x) = \text{softmax}(W_r x) $$

Only a subset of experts is activated per token.


12. Instruction Alignment via RLHF (InstructGPT, GPT-4, LLaMA-Chat)

Optimization against a learned reward model.

$$ \max_\theta \;\mathbb{E}_{x \sim \pi_\theta}\left[ r(x) - \beta\,\text{KL}(\pi_\theta \,|\, \pi_{\text{ref}}) \right] $$

Unifying Mathematical Insight

$$ \boxed{ \text{All modern LLMs are variations of:} \quad \text{Linear Attention} + \text{Nonlinear FFN} + \text{Residual Learning} } $$

Differences arise from:

  • Masking strategy
  • Attention efficiency
  • Conditional computation
  • Training objectives, not core mathematics

One-line Synthesis

Despite dramatic capability gains, the mathematical core of LLMs has remained remarkably stable since 2017—innovation lies in efficiency, routing, and optimization, not new equations.

Training Paradigm Evolution of Transformer-Based Language Models


1. Unsupervised Pretraining → Supervised Fine-Tuning (2018)

Models: GPT, BERT

  • Large-scale unsupervised pretraining on raw text
  • Task-specific supervised fine-tuning
  • Objective differs by architecture:
    • GPT: autoregressive language modeling
    • BERT: masked language modeling + next sentence prediction
  • Paradigm: learn general language representations, then adapt per task

2. Scaling as the Primary Driver (2019–2020)

Models: GPT-2, GPT-3

  • Same pretraining objective, massively scaled
  • Minimal or no architectural changes
  • Emergence of zero-shot, one-shot, and few-shot learning
  • Fine-tuning increasingly replaced by prompting
  • Paradigm: capabilities emerge from scale

3. Unified Multi-Task Pretraining (2019–2020)

Models: T5

  • All tasks reformulated into a text-to-text format
  • Denoising pretraining on clean, large-scale corpora
  • Joint multi-task fine-tuning across benchmarks
  • Paradigm: one model, one objective, all tasks

4. Compute-Efficient Pretraining (2020)

Models: ELECTRA

  • Generator–discriminator joint training setup
  • Discriminator learns from every token
  • Higher sample efficiency under fixed compute budgets
  • Downstream fine-tuning uses the discriminator only
  • Paradigm: maximize learning signal per FLOP

5. Few-Shot-First Evaluation (2020–2022)

Models: GPT-3, PaLM

  • No task-specific fine-tuning for benchmark evaluation
  • Strong emphasis on in-context learning
  • Benchmarks evaluated via natural language prompts
  • Paradigm: generalization via context, not gradients

6. Infrastructure-Driven Scaling (2022)

Models: PaLM

  • Massive dense models trained using the Pathways system
  • Predictable scaling across thousands of accelerators
  • Training stability becomes a first-class concern
  • Paradigm: systems and scale unlock new capabilities

7. Data-Optimal Training (2022–2023)

Models: LLaMA, LLaMA 2

  • More tokens per parameter following compute-optimal scaling laws
  • Long training runs on high-quality, filtered datasets
  • Open and reproducible data mixtures emphasized
  • Paradigm: data efficiency beats parameter inflation

8. Alignment as a Training Phase (2022–2023)

Models: InstructGPT, GPT-4, LLaMA 2-Chat

  • Multi-stage post-training pipeline:
    • Supervised fine-tuning with human demonstrations
    • Preference modeling
    • RLHF or DPO optimization
  • Alignment separated from capability learning
  • Paradigm: helpfulness and safety are trained, not emergent

9. Inference-Efficient Training Awareness (2023)

Models: Mistral 7B

  • Training supports inference-efficient mechanisms
  • Sliding window attention
  • Grouped-query attention
  • Architecture designed with deployment constraints in mind
  • Paradigm: train models that deploy efficiently

10. Conditional Computation (2024)

Models: Mixtral

  • Sparse Mixture-of-Experts pretraining
  • Top-k expert routing per token
  • Model capacity scales without proportional compute cost
  • Paradigm: train large capacity, activate selectively

11. Responsible Open Training (2024)

Models: Gemma

  • Gemini-derived training recipes
  • Extensive memorization, bias, and safety evaluation
  • Instruction tuning integrated early in training
  • Paradigm: capability and responsibility at release time

12. Data-Centric Small-Model Training (2024)

Models: SmolLM2

  • Multi-stage curriculum training
  • Dataset replacement and rebalancing mid-training
  • Late-stage upsampling of reasoning-heavy data
  • Paradigm: data scheduling dominates architecture for small models

Global Training Paradigm Shift

Era Core Training Idea
2018 Pretrain then fine-tune
2019–2020 Scale everything
2020–2021 Prompt instead of fine-tune
2022 Infrastructure and alignment
2023 Data efficiency and open models
2024 Conditional compute and data-centric training

One-line Synthesis

LLM training evolved from “learn language” to “scale it,” then to “align it,” and finally to “train it efficiently, conditionally, and responsibly.”

Evaluation & Benchmarks

Evaluation of Transformer-based language models has evolved from narrow, task-specific supervised benchmarks to broad, capability-oriented assessments. Early evaluation focused on accuracy under fine-tuning, measuring isolated skills rather than generalization or reasoning ability.

Early-era benchmarks (2018–2019) emphasized sentence-level understanding and reading comprehension. Datasets such as GLUE and SuperGLUE measured classification and inference accuracy, while SQuAD evaluated extractive question answering. These benchmarks primarily reflected fine-tuned performance rather than intrinsic model capability.

The scaling era (2020) marked a fundamental shift in evaluation methodology. With GPT-3, models were assessed in zero-shot, one-shot, and few-shot settings using natural-language prompts. Existing benchmarks were reused, but the evaluation protocol changed from gradient- based adaptation to in-context learning, reframing benchmarks as measures of generalization from context alone.

Reasoning and knowledge benchmarks (2021–2022) emerged to probe capabilities beyond surface-level language understanding. BIG-bench evaluated a wide range of tasks designed to reveal emergent abilities at scale. Commonsense and reasoning benchmarks such as ARC, HellaSwag, and Winogrande, along with mathematical datasets like GSM8K and MATH, highlighted discontinuous performance gains as models grew larger.

Code and program synthesis benchmarks (2021–2023) became central as models began generating executable programs. HumanEval and MBPP evaluated functional correctness using unit tests, with pass@k metrics capturing solution diversity. These benchmarks played a key role in evaluating models such as PaLM, GPT-4, Mistral, and Mixtral.

Multilingual and robustness evaluation assessed cross- lingual generalization and data balance. Benchmarks such as XNLI, TyDi QA, and multilingual variants of MMLU evaluated performance without translation, revealing significant language imbalance and varying generalization across linguistic domains.

Instruction-following and alignment benchmarks (2022–2024) shifted emphasis toward behavior and human preference. Human evaluations became a primary signal for instruction quality. TruthfulQA measured resistance to hallucinations, IFEval evaluated instruction compliance, and safety benchmarks assessed toxicity, bias, and refusal behavior.

Holistic evaluation (2023–2024) expanded beyond traditional NLP tasks. MMLU emerged as a multi-domain, few-shot academic benchmark. Professional and academic exams, including bar and medical exams, were used to assess human-level competence. Long-context retrieval tests evaluated models such as Mistral and Mixtral under extended context lengths.

Despite their usefulness, benchmarks exhibit key limitations, including susceptibility to training data contamination, overfitting to static test sets, and weak correlation with real-world reliability. As a result, no single benchmark can fully characterize model capability.

The current consensus is that modern evaluation must rely on benchmark suites combined with human judgment. Emphasis is increasingly shifting toward dynamic, adversarial, and real-world evaluations that better reflect deployment conditions.

Core insight: Evaluation has evolved from measuring task accuracy to probing general intelligence, reasoning, and alignment under minimal supervision.