LLM 101: Efficiency, Compression & Distillation — 2025

Antreas Antoniou · AI Researcher · Engineer · Founder, Axiotic AI

MSc/PhD Lecture — 2025/2026

"How do we build intelligence that runs on fruit, not a power station?"

Who Is Your Guide Today?

Antreas Antoniou — the short version:

  • 🎓 PhD in Machine Learning & Meta-Learning, University of Edinburgh — Thesis: learning to learn from few examples
  • 🔬 Research Scientist, Google · Applied Scientist, Amazon
  • 🏛️ Principal Scientist / Founder, Axiotic AIsmarter learning, not just bigger models

Why this lecture is personal:

I didn't just study this topic. I founded a company because I don't believe intelligence needs a skyscraper's worth of computers and a national power budget to run. We're not anti-scale — we just think there's a smarter path. Language is a shadow of the world. LLMs are missing the ground truth. We're building towards that.

The question that drives everything:

How do we build intelligence that runs on fruit, not on a power station?

The answer requires understanding what we're building — and then learning to make it radically more efficient.

The Banana Slide 🍌

Your brain weighs ~1.4 kg. Runs on ~20 watts — roughly 3 bananas per day.

A single NVIDIA DGX H200 — one AI training node — draws 10,200 watts, roughly ~2,000 bananas per day. A GPT-4 training run consumed enough energy to power a small city for a week.

And yet — your brain learns to recognise a cat from 5–10 examples. GPT-4 needed trillions of tokens. Your brain generalises, adapts, reasons about novel situations — and does it all on fruit.

ClaimNumberSource
Brain power consumption20 WRaichle & Gusnard, PNAS 2002
3 bananas energy315 kcalUSDA FoodData Central
Brain 18h energy need296 kcal20W × 18h = 360 Wh ≈ 296 kcal
Banana–brain match98.3%315 / 296 kcal
DGX H200 system power10.2 kWNVIDIA DGX H200 spec sheet
DGX bananas/day~2,00010.2 kW × 24h ÷ 0.122 kWh/banana
Efficiency ratio510:110,200W ÷ 20W
MacBook M4 under ML load~25–30 WApple M4 chip TDP estimates
🔑 The gap is not computational power. The gap is algorithmic intelligence. We're burning 510× more energy per node and still can't match the brain's flexibility.

What You'll Take Home Today:

  • 🧠 You'll understand what's actually inside a modern LLM — not hand-wavy analogies, the real architecture
  • 🗜️ You'll know how to make models 10–100× smaller without destroying them
  • 💻 You'll run a state-of-the-art reasoning model on your own laptop — tonight
1

LLM 101

The Transformer, Training & Fine-Tuning

Lecture Roadmap

SectionFocusSlides
1. LLM 101 🧱Transformer architecture, training, fine-tuning4–21
2. Compression & Distillation 🗜️Pruning, quantization, distillation, efficiency22–40
3. Smarter, Not Bigger 🧠Local agents, intelligence per watt, the future41–49
4. Resources & Big Picture 🗺️Tools, papers, what to do next50–55
At every stage I'll tell you where we are in the story. You'll never be lost.

The Transformer — Not That Kind

When someone says "transformer" in AI...

❌ Surely not this kind?

The Transformer — Not That Kind

Or this kind?

The Transformer — Not That Kind

❌ Or even this kind?

🔑 They mean a specific neural network architecture invented at Google in 2017 — Attention Is All You Need (Vaswani et al., arXiv:1706.03762)

The Transformer Architecture

The Transformer Block: A Layered Sandwich

One block = Attention + FFN + Normalization + Skip Connections, repeated N times.

The LEGO analogy: A few types of bricks — Token Embeddings, Positional Embeddings, Self-Attention, Feed-Forward, LayerNorm + Skip. Stack them. Same recipe, over and over. Simple rules, complex emergent results.
Transformer architecture diagram

The 2025 Transformer Recipe

Every major model (Llama 3, Gemma 2, Mistral, DeepSeek, Qwen) converged on:

① Tokenize — BPE or SentencePiece (vocabulary ~32K–128K tokens)

② Each transformer block:

  • RMSNorm (not LayerNorm) — pre-norm: x / sqrt(mean(x²) + ε) × γ — ~15% fewer ops (Zhang & Sennrich, 2019)
  • Grouped Query Attention (GQA) — reduces KV cache by sharing K,V heads across Q groups (Ainslie et al., 2023, arXiv:2305.13245)
  • SwiGLU feed-forward — SwiGLU(x) = Swish(xW₁) ⊙ (xV) × W₂ — consistently outperforms ReLU/GELU (Shazeer, 2020)
  • Residual connections (unchanged from 2017 — still essential)

③ Autoregressive next-token prediction

Common misconception: "Transformers use LayerNorm and ReLU." No. Not since 2023.

Self-Attention — The Core Mechanism

The exam hall analogy: For each question, peek at the classmates worth copying — history nerd for history, maths genius for maths.

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I have to offer?"
  • Value (V): "Here's my actual content"

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Why √d_k? Without it, dot products grow with dimension, pushing softmax into saturation — killing gradients.

Why this matters: This is why in-context learning works — the attention mechanism IS the learning mechanism at inference time. Every time you put examples in your prompt, attention is literally learning from them on the fly.

Multi-Head Attention (MHA): h parallel heads with different projections. Each learns different relationships — syntax, coreference, semantics.

Self-attention diagram

Attention Variants — MHA → GQA → MQA

The KV cache problem: During generation, we cache K and V for all previous tokens. For a 70B model with 80 heads, 128 dim/head, 8K context: ~20 GB KV cache per sequence. This is the serving bottleneck.

VariantYearK,V HeadsKV CacheUsed In
MHA (Multi-Head)2017hBaselineBERT, GPT-2/3
MQA (Multi-Query)20191÷ hPaLM, Falcon
GQA (Grouped-Query)2023g (1<g<h)÷ (h/g)Llama 2/3, Mistral, Gemma, DeepSeek

GQA is the new standard. Llama 2 70B: 64 Q heads, 8 KV heads → 8× KV cache reduction. MHA is effectively deprecated for large-scale deployment.

FlashAttention (Dao et al., 2022, arXiv:2205.14135): Rewrites attention to work in GPU SRAM tiles. 2-4× speedup, linear memory, exact output. Default in PyTorch ≥2.0.

Positional Embeddings — The Evolution

Transformers are permutation-equivariant by default — "Dog bites man" = "Man bites dog" without positional info.

Think of it as seat numbers in a lecture hall.

MethodYearStatus
Sinusoidal2017Original paper
Learned absolute2018Fixed max length
RoPE2021De facto standard
ALiBi2022Simple, no learned params
YaRN2023128K+ from 4K training
RoPE: Applies rotation matrices to Q and K. The dot product naturally encodes relative distance — relative position emerges from geometry, not a lookup table.
Positional encoding diagram

Skip Connections & Why Transformers Work

Skip connections — arguably the most important idea in deep learning:

output = Layer(x) + x — the identity shortcut

  • Without: chaotic, jagged loss landscape with sharp local minima
  • With: smooth, nearly convex — easy to optimise (Li et al., 2018, arXiv:1712.09913)

What makes Transformers uniquely powerful:

  • Associative/relativistic pair-wise representation learning — more expressive than ConvNets/MLPs
  • Areas and shapes of attention are learnable — not fixed like convolution kernels
  • Data-conditional computation — functions applied per token are data-conditional, not fixed after training
🔥 Hot Take: "The Transformer is fundamentally a series of meta-learning networks… Learning to learn was underneath it all, all along! And they say meta-learning is not popular nowadays"
— Antreas Antoniou, PhD in Meta-Learning, 2020
Skip connections / loss landscape

Pretraining — Autoregressive Language Modelling

Autoregressive training diagram

Analogy: Give a student millions of exam papers with the last answer always blanked. Their job: predict the next word, billions of times.

P(x₁,...,xₙ) = ∏ᵢ P(xᵢ | x₁,...,xᵢ₋₁)

  • Loss: Cross-entropy between predicted distribution and actual next token
  • Masked self-attention: Each position only attends to positions ≤ itself
  • Training is parallel: All positions computed simultaneously (teacher forcing)

The Chinchilla insight (2022): GPT-3 was dramatically undertrained. Modern: Llama 3 8B trained on 15T tokens.

🔑 Next-token prediction forces the model to build a rich internal world model — sufficient to learn grammar, facts, reasoning, coding, and common sense.

Pretraining — Diffusion Language Models

The challenger to autoregressive dominance:

AutoregressiveDiffusion
HowTokens one by one, L→RDenoise all tokens simultaneously
SpeedSlow (sequential)Fast (parallel)
QualitySotA (2025)Catching up fast
ExamplesGPT, Llama, ClaudeMercury (Inception Labs)

Mercury (Inception Labs, 2025, arXiv:2502.09992):

  • First diffusion LM competitive with AR models on standard benchmarks
  • 3× faster generation — parallel vs sequential
  • Strong for structured outputs where global coherence matters
This is the first credible alternative to autoregressive generation in 8 years. Pay attention.

Architecture Families — Dense → MoE → SSM → Hybrid

FamilyKey IdeaExamplesTradeoff
Dense TransformerFull attention, all params activeGPT-4, Llama 3, Gemma 2Maximum quality, maximum cost
MoERoute tokens to specialised sub-networksMixtral, DeepSeek-V3Capacity ≫ compute cost
SSM (Mamba)Linear-time recurrence, no attentionMamba, S4, RWKVFast inference, less proven at scale
HybridAttention + recurrence + memoryJamba, GriffinBest of both worlds? Under research

DeepSeek-V3 (Dec 2024): 671B total params, only 37B active per token. Trained for ~$5.57M. Competitive with GPT-4o.

Mamba (Gu & Dao, 2023, arXiv:2312.00752): O(N) time and memory. Strength: long-context streaming. Weakness: in-context learning. SSMs are not a Transformer replacement yet — they are a complement.

Reasoning Models — Test-Time Compute as New Scaling Axis

The breakthrough idea: Instead of only scaling training compute, scale inference compute.

OpenAI o1/o3 (2024-2025)

  • Extended chain-of-thought reasoning before answering, trained with RL
  • Can "think harder" on difficult problems by generating more tokens
  • o3: >90% on ARC-AGI, ~96% on AIME math competition

DeepSeek-R1 (Jan 2025, arXiv:2501.12948)

  • Open-weight 671B MoE reasoning model
  • Pure RL (GRPO) on base model produces emergent CoT reasoning without supervised CoT data
  • Matches o1-level performance on math and coding — open-weight + recipe published

Pause. o3 exceeded human performance on ARC-AGI in 2024. This happened in your lifetime. What does that mean?

The test-time scaling law: More reasoning tokens → better answers, following a smooth scaling curve. A 7B model thinking for 2000 tokens can outperform a 70B model thinking for 100.

🔑 You don't need a bigger model — you need a model that thinks longer.

Fine-Tuning — SFT → RLHF → DPO

The modern fine-tuning pipeline (2025):

Base model (pretrained) → [SFT] → [Preference Alignment] → Aligned model (deployed)

SFT (Supervised Fine-Tuning)

Train on (instruction, response) pairs. Standard cross-entropy. Teaches format/style. "What to say."

RLHF (Ouyang et al., 2022)

Train a reward model on human preferences. Optimise policy with PPO. Complex: 4 models in memory. "What's good."

DPO (Rafailov et al., 2023, arXiv:2305.18290)

Closed-form relationship between optimal RLHF policy and reward function. Eliminates reward model entirely. Simpler, more stable. Now the default for open-source alignment.

Variants: ORPO (Hong, 2024) — SFT + alignment in one step. SimPO (Meng, 2024) — reference-model-free. KTO (Ethayarajh, 2024) — unpaired preferences.

Think: SFT teaches "what to say." RLHF/DPO teaches "what's good" — capturing helpfulness, harmlessness, honesty.

PEFT vs Full Fine-Tuning

LoRA diagram

Full Fine-Tuning (FFT):

  • ✅ Updates all model parameters — best quality for deep domain shifts
  • ❌ For 70B in bf16: ~140 GB model + ~420 GB optimizer states (AdamW)

LoRA (Hu et al., 2021, arXiv:2106.09685):

  • Freeze pretrained weights. Add: W' = W + BA where r ≪ d
  • Trainable params ≈ 0.1–1% of total
  • Analogy: Instead of repainting a building, add a thin film to each window

QLoRA (Dettmers et al., 2023, arXiv:2305.14314):

  • Base model in 4-bit NormalFloat + LoRA adapters in bf16
  • Fine-tune 65B on a single 48GB GPU — democratised LLM fine-tuning
Practical guidance: Have H100s? → FFT wins. Single GPU? → QLoRA is remarkable. Many task variants? → LoRA adapters.

Antreas's take: "FFT often outperforms PEFT in my experience, especially when large compute is available." Don't default to LoRA because it's fashionable — default to it because your hardware demands it.

On Training the LLM Beast

There are three main directions for training LLMs efficiently:

1. Single GPU Training

  • Best for models that fit comfortably in one GPU's memory
  • Standard DataLoader + optimizer loop, no parallelism overhead
  • Limited to models ~7B on 80GB A100; use quantization to push further

2. Multi-GPU Data-Parallel Training

  • Each GPU holds a full copy of the model
  • Different data batches processed in parallel, gradients averaged
  • Does not work for models requiring more than 1 GPU's memory for forward + backward pass

3. FSDP — Fully Sharded Data Parallelism

  • Shards model parameters, gradients, and optimizer states across GPUs
  • Enables training models that are too large to fit on any single GPU
  • Each GPU holds only its shard; parameters are gathered on-demand during compute

🆕 Predicting Training Instability (2026)

  • Residual Koopman Spectral Profiling (arXiv:2602.22988): Treats the transformer forward pass as a dynamical system and applies Koopman operator theory to the residual stream
  • Can predict training instability ~100 steps before a loss spike happens by reading spectral precursors
  • Could eliminate catastrophic training failures — one of the most expensive problems in large-scale training
🔑 FSDP is the key technique for fine-tuning frontier-scale models without an ocean of memory — it's what makes Llama 3 405B fine-tuning practical. And with Koopman profiling, we may soon be able to prevent the training crashes that waste millions in compute.

On FSDP — Fully Sharded Data Parallelism

What is FSDP? Memory-efficient distributed training for large models.

  • Shards parameters, gradients, and optimizer states across all GPUs
  • Reduces per-GPU memory — enables larger batch sizes and training stability

FSDP Modes:

  • Full Parameter Sharding: Weights, gradients, and optimizer states all sharded
  • Mixed Precision (bf16/fp16): Reduces memory while maintaining stability
  • Activation Checkpointing: Saves memory by recomputing activations during backprop
  • Auto Wrapping: Automatically applies sharding to transformer blocks
Note: Optimal hyperparameters depend on hardware, model size, task, and data. Tune per workload.
FSDP sharding diagram

Transformer Timeline — 2017 to 2025

YearMilestoneScale
2017"Attention Is All You Need" at Google65M params
2018GPT-1 (OpenAI adopts Transformer)117M
2018BERT (bidirectional) — dominates NLP overnight340M
2019GPT-2 — first "scary good" generation1.5B
2020GPT-3 — era of scaling begins175B
2022Chinchilla — reveals models were undertrained70B (better)
2022ChatGPT — the public awakening
2023GPT-4, Llama, open-source explosion~1.8T (MoE)
2024Reasoning models (o1), Llama 3.1, MoE405B dense
2025DeepSeek-V3/R1, compression revolution671B MoE (37B active)

The Transformer has dominated every modality, domain, and task for 8 years. No architecture in ML history has had this kind of run.

2 — Model Efficiency, Compression & Distillation

Model efficiency section visual

Making Models Smaller, Faster, and Smarter

Now that we understand the beast — let's learn to tame it. The tools exist. They're elegant. And some of them are genuinely surprising.

The transition from "just make it bigger" to "make it smarter" is the central challenge of AI in 2025.

  • Section structure: Efficiency taxonomy → Pruning → Quantization → Distillation → Inference efficiency
  • Each technique reduces model cost while preserving capability
  • They compose — a pruned, quantized, distilled model can be 100× smaller with <5% quality loss

Three Kinds of Efficiency

🖥️ Computational💡 Idea🔬 Research
Efficient architecturesSimplicityEase of access & use
Efficient codeComposabilityEase of modification
Efficient hardwareCognitive complexityEase of evaluation
The best efficiency gains come from better ideas, not just better hardware. A breakthrough in idea efficiency (FlashAttention, MoE routing, the Transformer itself) yields orders-of-magnitude computational gains downstream.

The Four Weapons of Model Compression

ToolWhat It DoesTypical ResultAnalogy
Pruning 🔪Remove low-importance connections2–10× sparserSculptor removing marble to reveal the statue
Quantization 📐Reduce numerical precision2–8× smallerApproximating π as 3.14 instead of 3.14159265...
Distillation 🧪Train small model to mimic large10–100× smallerA wise teacher training a bright student
Architecture 🏗️Build efficiency in (MoE, SSM)VariablePurpose-built, not retrofitted

The power move: A quantized, pruned, distilled model on an efficient architecture can be 100× smaller with <5% quality loss.

The modern pipeline: Train big → Distill small → Quantize → Deploy

Pruning — SparseGPT, Wanda, and 2:4 Sparsity

Model pruning diagram

Core idea: Not all connections matter equally. Find and remove the unimportant ones.

The Lottery Ticket Hypothesis (Frankle & Carlin, 2018, arXiv:1803.03635): Within a large network exists a small subnetwork that performs equally well — but you must train the big one first to find it.

SparseGPT (2023, arXiv:2301.00774): One-shot pruning — no retraining. Prunes to 50-60% sparsity with minimal perplexity increase.

Wanda (2024): Prune by |weight| × |activation| — comparable to SparseGPT, no Hessian needed.

2:4 Semi-Structured Sparsity (NVIDIA): In every 4 weights, exactly 2 must be zero. Hardware-accelerated on Ampere/Hopper — 2× speedup. The only sparsity with real hardware support.

🆕 RMT-Guided Spectral Pruning (2026, arXiv:2602.22345): Uses Random Matrix Theory to decompose weight matrices into signal vs. noise components. Spectral pruning (removing noise eigenvalues) consistently outperforms magnitude pruning. Reveals predictable layer-wise redundancy patterns — mathematical tools to measure how much of a model actually carries information.

Honest assessment: unstructured sparsity is hard to accelerate. 2:4 structured sparsity is the pragmatic path. But RMT-guided pruning gives us the first principled way to know what to prune — expect this to reshape pruning strategies.

Quantization — The 2025 Landscape

Quantization precision ladder

Core idea: Use fewer bits per weight. Dramatic compression for minimal quality loss.

PrecisionBitsSize (70B)QualitySweet Spot?
FP3232280 GBBaselineTraining only
FP16/BF1616140 GB~SameStandard training
INT8870 GB~99%Post-training quant
INT4435 GB~95-98%✅ Inference sweet spot
BitNet 1.581.58~14 GB~Same*Research frontier

PTQ (Post-Training): Quantize after training. Fast. QAT (Quantization-Aware Training): Train with quantization in loop. Better quality, higher cost.

Quantization Methods — The Ecosystem

MethodBitsTypeKey InnovationBest For
bitsandbytes (NF4)4PTQNormalFloat + double quantQLoRA fine-tuning
GPTQ3-4PTQHessian-based layer-wise quantGPU inference, large ecosystem
AWQ4PTQProtects activation-salient channelsBetter quality than GPTQ
EXL22-8 mixedPTQPer-layer flexible bit-widthMax quality for size budget
GGUF (llama.cpp)2-8PTQCPU-optimised, mixed quantLocal/CPU inference
BitNet b1.581.58QATTernary {-1,0,1} from scratchResearch frontier

AWQ — Why Activation Awareness Matters

AWQ (Lin et al., 2024, arXiv:2306.00978) — the state of the art for PTQ in 2025.

The Core Insight

  • Not all weights are equally important
  • ~1% of weight channels carry disproportionate information
  • These "salient" channels correspond to large activation magnitudes (not large weight magnitudes!)
  • Naive quantization destroys these critical channels → quality collapse

The Solution

  • Identify salient channels by measuring activation magnitudes on a small calibration set
  • Apply per-channel scaling to protect salient channels before quantization
  • Scale factors absorbed into adjacent layers — zero overhead at inference
Why this is elegant: Activation statistics (not weight statistics) determine quantization sensitivity. Now the default quantization method in vLLM and most serving frameworks.

BitNet b1.58 — The Ternary Future?

BitNet b1.58 (Ma et al., 2024, arXiv:2402.17764)

Every weight is one of three values: {-1, 0, 1} — that's log₂(3) ≈ 1.58 bits.

How: QAT from scratch. Weights quantized via absmean: w' = Round(w / mean(|w|)) ∈ {-1, 0, 1}. Activations quantized to INT8 per-token.

Key Results

  • At 3B params: matches FP16 Transformer perplexity at same model size and training tokens
  • Dramatic improvements in latency, memory, throughput, and energy

The Radical Implication

  • Matrix multiplication becomes integer addition — no floating point hardware needed
  • Opens the door for custom silicon optimised for ternary ops
  • Could enable LLM inference on hardware without an FPU
Think: If BitNet works at scale, we don't need FP hardware for AI. What does that imply for chip design? For putting AI in a wristwatch?

The GGUF Revolution — LLMs for Everyone

GGUF (GPT-Generated Unified Format) — the file format that democratised local AI. Created by the llama.cpp project (Georgi Gerganov, 2023-2024).

  • CPU-first inference — runs on x86, ARM, Apple Silicon, RISC-V
  • Mixed quantization: different bit-widths per tensor type
  • Quality tiers: Q2_K (tiny, lossy) → Q4_K_M (sweet spot) → Q8_0 (near-lossless)

The Ecosystem Built on GGUF

  • Ollama — one-command local LLM (ollama run llama3.2)
  • LM Studio — GUI for model exploration
  • Open WebUI — ChatGPT-like interface, fully local

The Democratisation Story

  • 2023: Running a 7B model required ≥$1,000 GPU
  • 2025: GGUF Q4 runs 7B on a MacBook Air, a Raspberry Pi 5, or a phone
🔑 The real benchmark isn't MMLU — it's "can a grad student run this on their laptop?"

Distillation: How to Teach a Student to Think Like a Master

Distillation paper diagram
Distillation high level idea

"Distilling the Knowledge in a Neural Network" — Hinton, Vinyals, Dean (2015, arXiv:1503.02531)

Core idea: A smaller "student" learns from a larger "teacher's" soft predictions.

Why soft labels beat hard labels:

  • Hard label: "this is a cat" — one bit of information
  • Soft label: "90% cat, 5% dog, 3% tiger..." — rich relational structure
  • The teacher's distribution over wrong answers reveals similarity — this "dark knowledge" is where the magic lives

The distillation loss:

L = α·CE(y, σ(z_s)) + (1-α)·T²·KL(σ(z_t/T) ‖ σ(z_s/T))

Where T = temperature (higher → softer distributions → more dark knowledge exposed)

The Temperature Trick 🌡️

When you train a student directly on hard labels:

  • Cat: 100%, Dog: 0%, Fox: 0% — the student learns nothing about similarity

With T > 1 (temperature scaling on teacher logits):

  • Cat: 70%, Dog: 20%, Fox: 10% — soft targets
  • The student now learns: "It's a cat, but it kind of looks like a dog."
  • This relational information is the hidden gold in the teacher's predictions
Soft targets carry dark knowledge — the teacher's entire learned understanding of similarity structure between classes, compressed into a probability distribution. Hard labels throw all of that away.
Distillation temperature diagram

The Conundrum — Why Does Distillation Work?

Two paths to a 1B student:

  • Path A: Train 1B directly on a giant dataset
  • Path B: Train a 72B teacher first, then distill to 1B

Path B wins — often dramatically.

But why? Nobody fully knows. Antreas's theory:

"Think of the larger model as the larger parallel search for a good solution — more parameters and paths, more likely a path found will be closer to the global best. Once learned, that path can be compactly extracted into a smaller model." — Antreas Antoniou
  • Lottery Ticket: The big teacher finds a better subnetwork. Distillation extracts it.
  • Compressed curriculum: Difficulty-calibrated supervision — confident distributions for easy, uncertain for hard.
Lottery ticket hypothesis
🆕 2026 Validation — Invariant Algorithmic Cores (arXiv:2602.22600): Different random seeds → functionally identical computational structures. The optimal algorithm is uniquely determined by the data. Gradient descent finds the minimum free-energy solution — and that solution is unique. This directly validates distillation: if the core algorithm is invariant, distillation preserves it perfectly.

Distillation Types — 2025 Taxonomy

Classical:

TypeWhat's TransferredReference
Knowledge Distillation (KD)Soft labels (output logits)Hinton et al. (2015)
Progressive DistillationChain: teacher → medium → smallFurlanello et al. (2018)
Layer-wise DistillationHidden representations per layerRomero et al. (2015) FitNets
Task-Specific DistillationDomain fine-tuned knowledgeSun et al. (2019) TinyBERT

Modern (2024-2025):

TypeWhat's TransferredKey Example
Reasoning DistillationChain-of-thought tracesDeepSeek-R1 → R1-Distill-7B
Synthetic DataTeacher-generated training dataGPT-4 → Phi-3, Gemini → Gemma
Constitutional AI / RLAIFAI-generated preference labelsClaude's Constitutional AI
Behaviour CloningInput-output pairs from teacher APIOrca, Vicuna, Alpaca
🔑 The boundary between "distillation" and "synthetic data generation" has blurred. When GPT-4 generates training data for Phi-3, that's functionally distillation.

DeepSeek-R1 — Distilling Reasoning Itself

The most important distillation result of this era. A 7B model distilled from a 671B can outperform the raw 671B on specific tasks. Let that sink in.

The details (DeepSeek-AI, Jan 2025, arXiv:2501.12948)

The Recipe

  • Train DeepSeek-R1 (671B MoE) with pure RL (GRPO) to reason via chain-of-thought
  • Generate ~800K reasoning traces across maths, coding, science, logic
  • Fine-tune smaller base models (Qwen-2.5, Llama-3) on these traces — pure SFT, no RL on student
ModelParamsAIME 2024CodeforcesRuns On
OpenAI o1-miniUnknown63.6%1820 EloAPI only
DeepSeek-R1671B (37B active)79.8%2029 Elo~4× H100
R1-Distill-Qwen-32B32B72.6%1691 Elo1× H100
R1-Distill-Qwen-7B7B55.5%1189 EloLaptop
R1-Distill-Qwen-1.5B1.5B28.9%Phone
🔑 Proves reasoning itself is distillable. A 7B model on a laptop doing competition mathematics. The teacher's CoT is a structured curriculum — the student learns how to think.

Synthetic Data as Distillation — The New Pipeline

The pattern: Teacher generates data → Student trains on it → Student rivals models 10-50× its size

StudentParamsTeacherKey Result
Phi-3-mini3.8BGPT-4 + webRivals Mixtral-8×7B (46.7B) on reasoning
Orca 27B/13BGPT-4Strong step-by-step reasoning via explanation tuning
Gemma 22B/9B/27BGemini2B competitive with models 5× larger
Llama 3.21B/3BLlama 3.1 405BSotA sub-3B via pruning + distillation
Qwen 2.50.5B–72BInternalCompetitive across all sizes
"This is distillation's killer app: not matching logits, but generating wisdom."

The uncomfortable implication: The best small models are all distilled from proprietary large models. Exception: DeepSeek-R1 — open-weight teacher, open recipe.

PRMs and Constitutional AI

Process Reward Models (PRMs)

  • Instead of judging only the final answer (outcome reward), PRMs score each intermediate reasoning step
  • Dense, step-level supervision → more informative training signal
  • Enables self-improving loops: model generates reasoning → PRM scores steps → model improves
  • Key to making reasoning models (o1, R1) work reliably

Constitutional AI / RLAIF (Anthropic)

  • Replace expensive human preference labelling with AI-generated feedback
  • Define a "constitution" (principles the model should follow)
  • The AI critiques and revises its own outputs against these principles
  • Generates preference pairs automatically → train with DPO/RLHF
  • Scales supervision without scaling human annotation costs
Connection to distillation: Both PRMs and Constitutional AI are forms of automated knowledge transfer — the system generates its own training signal, philosophically distillation from encoded principles.

Speculative Decoding — Free Speed, Mathematically Exact

The most elegant efficiency technique in modern inference.

The problem: Autoregressive decoding is memory-bandwidth bound. The GPU loads full model weights for each token but does very little computation. Hardware is >90% idle.

The solution (Leviathan et al., 2023, arXiv:2211.17192):

  • Small draft model (e.g., 1B) generates K candidate tokens — fast
  • Large target model (e.g., 70B) scores all K tokens in ONE forward pass
  • Accept all tokens up to first rejection; resample at rejection point
  • Output distribution is EXACTLY the target model's — mathematically proven
Why it works: Verifying K tokens takes ~the same time as generating 1. If draft acceptance rate is ~70-80%: 2-3× wall-clock speedup for free.

Let that land: This isn't hardware. This is a mathematical trick. And it's shaving 3–5× off your API bill.

The distillation connection: A better draft model → higher acceptance rate → more speedup. Acceptance rate is a quantitative measure of distillation quality.


Used in production: Google (Gemini), Anthropic (Claude), most major providers.

The inference stack:

  • FlashAttention v2/v3 — tiled SRAM attention, 2-4× speedup
  • PagedAttention (vLLM) — virtual memory for KV cache, 2-4× more concurrent users
  • Continuous batching — dynamic add/remove requests mid-generation
2025 minimum serving stack: vLLM + FlashAttention + PagedAttention + continuous batching + speculative decoding. This is not optional.

FlashAttention & PagedAttention — The Inference Stack

FlashAttention (recap for serving context)

  • v1 (Dao et al., 2022): 2-4× speedup, O(N) memory — tile computation into GPU SRAM
  • v2 (Dao, 2023, arXiv:2307.08691): 50-73% theoretical max FLOPS on A100
  • v3 (Shah et al., 2024): H100 features — FP8, TMA, warp specialisation
  • Now default everywhere. If you're not using FlashAttention in 2025, you're wasting money.

PagedAttention / vLLM (Kwon et al., 2023, arXiv:2309.06180)

  • Problem: KV cache is variable-length and wasteful — pre-allocating for max length wastes ~60-80% of GPU memory
  • Solution: Manage KV cache like virtual memory pages — allocated on demand, freed when done
  • Result: Near-zero memory waste → 2-4× more concurrent requests per GPU

Continuous Batching: Dynamically add/remove requests mid-generation. No waiting for the longest sequence.

The 2025 minimum viable serving stack: vLLM/TGI + FlashAttention + PagedAttention + continuous batching + (optionally) speculative decoding.

Algorithmic Efficiency — Beyond Hardware

The biggest efficiency gains come from algorithmic breakthroughs, not faster chips. Here are the key directions:

Attention Algorithms

  • FlashAttention v1/v2/v3 — Tiled SRAM computation. Exact attention, 2-4× speedup, O(N) memory. Now the default everywhere.
  • Linear Attention — Approximate softmax with kernel trick. O(N) time. Trade-off: quality loss for long sequences.
  • Sparse Attention — BigBird, Longformer: attend to local windows + global tokens. O(N√N) or O(N).
  • Local/Dynamic Attention — Sliding window (Mistral), dilated patterns. Adaptive context depending on content.

Low-Level Optimisation

  • Custom CUDA kernels — Hand-optimised fused kernels for critical operations (attention, RMSNorm, rotary embeddings)
  • Operation fusion — Combine multiple sequential operations into a single GPU kernel pass. Eliminates memory round-trips. Examples: fused attention+softmax, fused layernorm+residual
  • Torch.compile / Triton — JIT compilation of PyTorch graphs to optimised GPU code
🔑 A single algorithmic insight (FlashAttention) delivered more speedup than two hardware generations. The lesson: invest in algorithms, not just silicon.

Distillation — How Far Can We Go?

🍎 Low Hanging Fruit (Available Now)

  • Reasoning distillation via CoT traces (DeepSeek-R1 recipe)
  • Synthetic data pipelines from frontier models (Phi-3, Orca, Gemma)
  • Progressive distillation for multi-scale model variants (e.g., 70B → 32B → 7B → 1.5B)
  • Better distillation for local single-GPU deployment — quantization + distillation composition

🍊 Medium Hanging Fruit (Active Research)

  • Reducing distillation compute cost itself — distillation is expensive; can we make it cheaper?
  • Multi-teacher → single student (ensemble of diverse teachers → one versatile student)
  • Distillation at same scale for generalization improvement (teacher and student same size)
  • Synthetic data as distillation — the boundary between data generation and distillation has blurred
  • Cross-architecture distillation (Transformer teacher → SSM student)

🔮 High Hanging Fruit (Blue Skies)

  • Lottery Ticket + distillation — detect redundant connections at training time, not after
  • Distill depth-by-depth as layers become "ready" — progressive layer-wise extraction
  • Information-theoretic limits: What is the minimum model size to capture a given capability?
  • 🆕 Invariant Algorithmic Cores (arXiv:2602.22600): If cores are invariant across random seeds, then the true information content is orders of magnitude smaller than parameter count — perfect distillation may be theoretically achievable
  • Self-distillation: A model improving by learning from its own best outputs (connects to RL)
A 7B model doing 85% of a 671B. The capability floor is rising every month. Where does it stop?
3

Smarter, Not Bigger

Local Agents & Intelligence Per Watt

The Scaling Monoculture — The Problem

Since 2020, the dominant recipe: More parameters → better. More data → better. More compute → better.

This works. And that is precisely the problem.

  • Funding flows to scale, not novelty
  • Researchers optimise for GPT-N+1 instead of questioning the architecture
  • Hardware roadmaps ossify around dense matrix multiply
  • Academic labs cannot compete → the field becomes industrially captured

What the scaling laws actually say: They describe one architecture family on one data type. They say this is predictable, not optimal.

Signs of Diminishing Returns (2024-2025)

  • GPT-4 → GPT-4o: modest improvements despite enormous investment
  • DeepSeek-V3 achieved GPT-4-class with ~$6M training by being smarter, not bigger
  • Biggest gains from better data, better algorithms — not bigger models
The marathon analogy: Runner A consumes 10,000 cal/day but trains sloppily. Runner B consumes 2,000 cal/day but trains perfectly. Runner B wins — not more resources, but intelligently used resources.

Intelligence Per Watt — The Right Metric

Current ML metrics (MMLU, Arena Elo) all measure capability-at-any-cost. What if we measured differently?

SystemPowerCapabilityIntelligence/Watt
🧠 Human brain~20WGeneral intelligence, lifelong learningAbsurdly high
🖥️ GPT-4 inference (per query)~3-10 kW·sBroad knowledge, strong reasoningLow
💻 Llama 3.2-3B on phone~2-5WUseful for many tasksMuch higher
📱 Specialised edge model<1WExpert at one domainHighest per niche
  • 2 billion smartphones can run 3B models today — more aggregate compute than any cloud
  • Data centre energy is a geopolitical and environmental issue — Microsoft buying nuclear reactors for AI
  • Many applications need AI where there's no cloud: rural clinics, submarines, developing countries
The brain's trick: Compresses experience into reusable representations, maintains persistent state, builds hierarchical models — all on 20W, 3 bananas/day.

What Small Models Can Do NOW (2025)

The floor of small model capability is rising faster than the ceiling of large models:

ModelParamsTeacherRemarkable Achievement
Phi-3-mini3.8BGPT-4 (synthetic)Rivals Mixtral-8×7B (46.7B total) on reasoning. Runs on a phone.
Gemma 22B / 9BGemini2B competitive with models 5× larger
Llama 3.21B / 3BLlama 3.1 405BStrong instruction following, tool use, multilingual. On-device.
R1-Distill-Qwen-7B7BDeepSeek-R1 (671B)55.5% AIME 2024 — competition maths. On a laptop.
Qwen 2.5-Coder-1.5B1.5BLarger QwenCompetitive with CodeLlama-7B. Runs on a Raspberry Pi.
🔑 The frontier model's purpose is shifting. It's not the product. It's the teacher. Its job is to discover intelligence; distillation's job is to deliver it to every device on earth.

Local Agent Architecture

A personal AI agent running entirely on your hardware:

ComponentRoleExample
Reasoning CoreQuantized LLM (3-7B, GGUF Q4)Llama 3.2 3B, Phi-3-mini
Knowledge RetrievalEmbedding model + vector storenomic-embed + ChromaDB/SQLite-VSS
Tool UseFile system, browser, APIs, codeFunction calling
MemoryPersistent history, user preferencesLocal SQLite
OrchestrationPlanning, context managementCustom / LangChain

Runtime: llama.cpp / MLX / ExecuTorch  ·  Power: 5-15W  ·  Latency: <100ms/token  ·  Cost: $0/query  ·  Privacy: 100% local

This is not "ChatGPT on your laptop." This is an AI that knows your documents, learns your patterns, and never phones home.

Five Reasons Your Laptop Beats the Cloud

🔒 Privacy — Data never leaves your device. Period. Medical records, financials, personal conversations — never sent to a third party. GDPR compliance is trivial.

Latency — No network round trip. 10ms local vs 500ms+ cloud. Critical for interactive use (coding assistants, autocomplete), robotics, real-time control.

💰 Cost — Zero marginal cost per query. A $500 device running 24/7 costs less than moderate API bills. Scales to billions of users without billions in infrastructure.

🌐 Offline — Planes, rural clinics, submarines, developing countries, bad WiFi. Resilient to cloud outages, API deprecation, provider pricing changes. Sovereign — no company can revoke your access.

🎯 Personalisation — Fine-tune on your data. Adapt to your style. Learn your domain. Compound returns: your agent improves the longer you use it.

"The best AI is the one that's always there, always private, and always yours."

The Path Forward — Research Agenda

① Better Learning Signals

  • Process reward models → dense, step-level supervision
  • Self-play and verification loops → models that improve by checking their own work
  • Curriculum learning → teach efficiently, like a tutor, not a fire hose

② Architecture Innovation

  • SSMs (Mamba) → linear-time sequence modelling, natural for streaming
  • MoE → activate only relevant parameters (DeepSeek-V3: 671B total, 37B active)
  • Neuromorphic / event-driven computation → process only when something changes

③ Distillation as the Bridge

  • Frontier models are teachers, not products — their purpose is to bootstrap smaller models
  • DeepSeek-R1 proved reasoning is distillable; what about planning? Creativity? Common sense?

④ Efficiency Through Fundamental Insight

  • BitNet b1.58: we may not need floating point — just {-1, 0, 1}
  • Speculative decoding: 3× free speedup with the right small model
  • Sparsity: the brain is >99% sparse at any moment

The Open Frontiers: What We Don't Know Yet

Fundamental 🧠

  • What are the information-theoretic limits of distillation? Minimum model size for a given capability?
  • Which capabilities compress cheaply (factual recall) vs expensively (reasoning, planning)?
  • Are there phase transitions in model size below which capabilities vanish?
  • 🆕 Invariant Algorithmic Cores (arXiv:2602.22600): If trained models converge to the same functional core regardless of seed, what does this mean for the true complexity of intelligence?
  • 🆕 Koopman Spectral Profiling (arXiv:2602.22988): Can we treat the forward pass as a dynamical system and predict (or prevent) training instabilities before they happen?

Architectural 🏗️

  • Can SSMs match Transformers at 1/10th the compute for general language?
  • Modular models: plug in domain-specific modules without retraining?
  • Efficient continual learning without catastrophic forgetting?
  • 🆕 RMT-guided architecture (arXiv:2602.22345): Can we use Random Matrix Theory to design architectures that are inherently less redundant — building lean models from the start?

Systems 📱

  • On-device fine-tuning with limited memory?
  • Federated learning for local agents to improve collectively without sharing private data?
  • Hardware co-design for ternary/sparse computation? (BitNet implies radically different chips)

Evaluation 📊

  • Benchmarks for efficiency: not "score at any cost" but "score at 5W / at 1B params / at $0 API"
  • How to evaluate personalisation quality?
These are not incremental questions. They define a research programme for a generation.

Call to Action — Download Ollama Tonight

After this lecture, do one thing. Not tomorrow. Tonight. Before you go to sleep.

# Install Ollama (macOS/Linux — one command)
curl -fsSL https://ollama.ai/install.sh | sh

# Run a 7B reasoning model on your laptop
ollama run deepseek-r1:7b

# Ask it to solve a maths problem. Watch it think.

Then reflect:

  • This model was distilled from a 671B-parameter system
  • It's running on your hardware, with no internet, no API key, no cost per query
  • Two years ago, running a state-of-the-art reasoning model locally was impossible. Today it's one command
  • The gap between frontier and local is closing faster than anyone predicted
The best time to start building local AI was two years ago. The second best time is right now, tonight, before you close this laptop.

Key Tools & Frameworks

You now have the map. Here are the tools.

Training & Fine-Tuning

  • axolotlgithub.com/axolotl-ai-cloud/axolotl — minimal but thorough LLM training codebase, fine-tuning & distillation
  • Unsloth — github.com/unslothai/unsloth — 2× faster, 60% less memory
  • HuggingFace TRL — RLHF, DPO, distillation pipelines
  • vLLM — github.com/vllm-project/vllm — production LLM serving

Local AI Tools

  • Ollama — ollama.ai — ollama run llama3.2
  • LM Studio — lmstudio.ai — GUI for model exploration
  • llama.cpp — github.com/ggml-org/llama.cpp — local inference, GGUF
  • Open WebUI — github.com/open-webui/open-webui — ChatGPT-like local UI

Frontier Model Demos

  • ChatGPT (chatgpt.com) · Claude (claude.ai) · Gemini (gemini.google.com)
  • Groq (groq.com) — new hardware architecture for LLMs, custom LPU, extremely fast inference
  • DeepSeek (chat.deepseek.com) — open-weight reasoning models
  • Perplexity (perplexity.ai) — search + LLM summary, excellent for research

HuggingFace

huggingface.co — the arXiv of models. Explore Spaces for demos, Models for weights, Datasets for training.

Key Papers — Your Reading List

Architecture & Training

  • "Attention Is All You Need" — Vaswani et al. (2017) — arXiv:1706.03762
  • "Training Compute-Optimal LLMs" (Chinchilla) — Hoffmann et al. (2022) — arXiv:2203.15556
  • "RoFormer: Rotary Position Embedding" — Su et al. (2021) — arXiv:2104.09864
  • "GLU Variants Improve Transformer" — Shazeer (2020) — arXiv:2002.05202

Alignment

  • "Direct Preference Optimization" — Rafailov et al. (2023) — arXiv:2305.18290
  • "Training LMs to Follow Instructions" (InstructGPT) — Ouyang et al. (2022)

Efficiency & Compression

  • "FlashAttention-2" — Dao (2023) — arXiv:2307.08691
  • "GPTQ" — Frantar et al. (2022) — arXiv:2210.17323
  • "AWQ" — Lin et al. (2024) — arXiv:2306.00978
  • "BitNet b1.58" — Ma et al. (2024) — arXiv:2402.17764
  • "SparseGPT" — Frantar & Alistarh (2023) — arXiv:2301.00774

Distillation & Small Models

  • "Distilling the Knowledge in a Neural Network" — Hinton et al. (2015) — arXiv:1503.02531
  • "The Lottery Ticket Hypothesis" — Frankle & Carlin (2018) — arXiv:1803.03635
  • "DeepSeek-R1" — DeepSeek-AI (2025) — arXiv:2501.12948
  • "Phi-3 Technical Report" — Microsoft (2024) — arXiv:2404.14219
  • "Mamba" — Gu & Dao (2023) — arXiv:2312.00752

2026 Frontier Research

  • "Invariant Algorithmic Cores" — (2026) — arXiv:2602.22600 — models discover unique algorithmic solutions
  • "Residual Koopman Spectral Profiling" — (2026) — arXiv:2602.22988 — predicting training instability
  • "Structure & Redundancy via RMT" — (2026) — arXiv:2602.22345 — spectral pruning via Random Matrix Theory

Big Picture: High Level Stages

The high-level pipeline connecting everything we've covered:

  • From data collection and tokenisation through pretraining, fine-tuning, and alignment
  • To compression and distillation for efficient deployment
  • Finally to local / edge deployment and personalisation loops
Each stage offers its own efficiency levers — the techniques we've explored apply at different points in this pipeline.

The colour-coding represents degree of relatedness between stages — not the section colours used in this deck.
High level stages diagram

Big Picture: Underlying Technique Components

The technique components that power each stage of the pipeline:

  • Each box = a family of techniques (attention, normalisation, training objectives, compression…)
  • Colours indicate relatedness between technique clusters
  • Searching any of these names + "LLM" surfaces the key papers
This overview is your map of the field. You now know enough to navigate it independently.

Tip: Replace "other" with "LLM" in any search query — this surfaces the LLM-specific implementations and benchmarks.
Technique components diagram

The Story We Told Today

Section 1 — LLM 101: The Transformer: LEGO bricks stacked into a sandwich, trained by predicting the next word trillions of times. The 2025 recipe: RMSNorm + SwiGLU + RoPE + GQA. Deceptively simple. Wildly powerful — and fundamentally a meta-learning system (Antreas's hot take!). Plus: how to train the beast at scale (FSDP, Koopman spectral profiling) and fine-tune it efficiently (LoRA, QLoRA, DPO).

Section 2 — Compression & Distillation: The scaling monoculture is unsustainable. But we have tools: pruning (carve the statue — now with RMT-guided spectral methods), quantization (approximate π), distillation (teach the student). Combined: 100×+ compression. DeepSeek-R1 proved even reasoning is distillable. And 2026 research on Invariant Algorithmic Cores suggests perfect distillation may be theoretically achievable.

Section 3 — Smarter, Not Bigger: Small models trained with distillation now rival giants from 12 months ago. The future is local, private, efficient. Intelligence per watt, not intelligence per dollar.

Section 4 — Resources: Tools, papers, and the big picture diagrams to navigate the field.

The arc of this lecture:

  • Beginning: We started with a gap — 20W vs 10,000W. A brain running on fruit vs a cluster running on a power station.
  • Middle: We learned the tools to close that gap — pruning, quantization, distillation, speculative decoding, architectural innovation.
  • Resolution: The gap is closing. You're watching it happen in real time. A 7B model on your laptop today does what a data centre couldn't do two years ago.
The brain runs on 3 bananas. AI runs on 1,500. Closing that gap is the most important research challenge of our generation.

Thank You & Contact

Antreas Antoniou

📧 antreas@axiotic.ai  ·  iam@antreas.io

🐦 @AntreasAntoniou  ·  🏢 axiotic.ai

Questions? Catch me after the lecture — or email me. Distillation theory, local agent systems, meta-learning connections, the Transformer's true nature, AI research careers — I'm here for all of it. We can have a very long coffee conversation.

Want to see what the community thinks about efficiency? Karpathy's thread: x.com/karpathy/status/1864023344435380613

"The future of AI is not bigger models in bigger data centres. It's smarter models on every device."

Remember: Download Ollama tonight. The revolution fits on your laptop.