---
date: '2025-10-02'
description: by Bharat Venkitesh, and transformers part 2
id: '5'
modified: 2026-06-05 15:08:26 GMT-04:00
seealso:
  - '[[thoughts/Transformers|Transformers]]'
  - '[[thoughts/LLMs|LLMs]]'
  - '[[thoughts/vllm|vLLM]]'
tags:
  - ml
  - tsfm
title: lecture five
created: '2025-10-02'
published: '2025-10-02'
pageLayout: default
slug: thoughts/tsfm/5
permalink: https://aarnphm.xyz/thoughts/tsfm/5.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
## [[thoughts/Scaling laws|ontological]]

- [Learning Curves: Asymptotic Values and Rate of Convergence](https://proceedings.neurips.cc/paper/1993/hash/1aa48fc4880bb0c9b8a3bf979d3b917e-Abstract.html) \[@cortes1993learningcurves\]
  - Bell Labs
  - $$
    \epsilon_{\text{test}} = a + \frac{b}{l^{\alpha}}, \epsilon_{\text{train}} = - \frac{b}{l^{\beta}}
    $$
- [ImageNet Classification with Deep Convolutional Neural Networks](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) \[@krizhevsky2012imagenet\]
  - Residual network
  - VGG
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) \[@kaplan2020scalinglawsneurallanguage\]&#x20;
  - OpenAI’s empirical study on power-law relationships in language model performance
  - Key findings:
    - Loss scales as power-law with model size $N$, dataset size $D$, and compute $C$
    - Performance depends strongly on scale, weakly on model shape
    - $$
      L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}
      $$
    - Smooth, predictable improvements enable accurate extrapolation
- [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) \[@hoffmann2022trainingcomputeoptimallargelanguage\]&#x20;
  - DeepMind’s Chinchilla paper
  - Revised Kaplan et al.’s findings: model size and training data should scale equally
  - For compute-optimal training: $N_{opt} \propto C^{0.5}$, $D_{opt} \propto C^{0.5}$
  - Chinchilla (70B params, 1.4T tokens) outperforms Gopher (280B params, 300B tokens)

![[thoughts/Scaling laws#power law]]

### characteristics

- Self-similarity: power laws exhibit scale invariance (no characteristic scale)
- Heavy tails: extreme values more likely than in exponential distributions
- Log-linearity: $\log y = \log a + k \log x$ appears linear in log-log plots
- Predictability: smooth extrapolation enables forecasting from limited data

### implications for training

- Can predict compute requirements for target performance
- Helps determine optimal allocation between model size, data, and training time
- Enables cost-benefit analysis: when to stop scaling vs. architectural improvements
- Informs decisions about data collection vs. algorithmic innovation

## design choices.

> \[!question\] Question
>
> data size affect performance?

> \[!question\] Question
>
> scale data and parameters?

_answer_: Pareto frontier and compute-optimal efficiency

### depth/width scaling

ratio based on $d_{\text{model}}$ versus depth

### architectures

GLU and SwitchTransformers

## joint scaling

[A Constructive Prediction of the Generalization Error Across Scales](https://arxiv.org/abs/1909.12673) \[@rosenfeld2019constructivepredictiongeneralizationerror\]&#x20;

cosine learning rate:

- warmup phase
- decay phase

## how do we choose the right hyperparameters?

- parameters non-embedings

## $\mu P$ cheatsheet

see also:

- [Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer](https://arxiv.org/abs/2203.03466) \[@yang2022tensorprogramsvtuning\]&#x20;
- [Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster](https://arxiv.org/abs/2304.03208) \[@dey2023cerebrasgptopencomputeoptimallanguage\]&#x20;

$$
\begin{array}{|l|c|c|}
\multicolumn{1}{l}{} & ~~\text{Standard Parameterization (SP)}~~ & \multicolumn{1}{c}{~~~~~~~~~\text{Maximal Update (} \mu \text{P)}~~~~~~~~~} \\
\multicolumn{3}{l}{\textbf{Variables}} & & \\
\hline \hline
W & \multicolumn{2}{c|}{\text{A multiplicative or fully-connected weights tensor}} & \\
\hline
b & \multicolumn{2}{c|}{\text{A bias weights tensor}} & \\
\hline
X, Y & \multicolumn{2}{c|}{\text{Activation tensors: layer input, output, respectively}} & \\
\hline
d_{\text{model,base}} & \multicolumn{2}{c|}{\text{Proxy (base) model\unicode{x2019}s layer width}} & \\
\hline
d_{\text{model}} & \multicolumn{2}{c|}{\text{Width of each layer}} & \\
\hline
d_{\text{head}} & \multicolumn{2}{c|}{\text{Size of each attention head}} & \\
\hline
\text{embed} & \multicolumn{2}{c|}{\text{Combined token} + \text{position embedding function}} & \\
\hline
\eta_{\text{base}} & \multicolumn{2}{c|}{\text{The base learning rate (LR): Maximum in training schedule}} & \\
\hline
\sigma_{\text{base}} & \multicolumn{2}{c|}{\text{The base initialization standard deviation}} & \\
\hline
m_{\text{width}} & \text{---} & \text{Layer width multiplier:} \\
& & d_{\text{model}}/d_{\text{model,base}} \\
\hline
m_{\text{emb}} & \text{---} & \text{Embedding output multiplier} \\
\hline\hline
\multicolumn{3}{l}{\textbf{Empirically Tuned Values}} & & \\
\hline \hline
d_{\text{model,base}} & \text{---} & 256 \\
\hline
\eta_{\text{base}} & \text{Must tune for each model size} & 6e\text{-}3 \\
\hline
\sigma_{\text{base}} & 0.02 & 0.08 \\
\hline
m_{\text{emb}} & \text{---} & 10.0 \\
\hline\hline
\multicolumn{3}{l}{\textbf{Formulas}} & & \\
\hline \hline
\text{Embedding initializer} & W_{\text{emb}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2) & W_{\text{emb}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2) \\
\hline
\text{Embedding LR} & \eta_{\text{emb}} = \eta_{\text{base}} & \eta_{\text{emb}} = \eta_{\text{base}} \\
\hline
\text{Embedding output} & Y_{\text{emb}} = \text{embed}(X) & Y_{\text{emb}} = m_{\text{emb}} \cdot \text{embed}(X) \\
\hline
\text{LN initializer} & W_\gamma \sim 1, b_\beta \sim 0 & W_\gamma \sim 1, b_\beta \sim 0 \\
\hline
\text{LN LR} & \eta_{\text{LN}} = \eta_{\text{base}} & \eta_{\text{LN}} = \eta_{\text{base}} \\
\hline
\text{Bias initializer} & b \sim 0 & b \sim 0 \\
\hline
\text{Bias LR} & \eta_b = \eta_{\text{base}} & \eta_b = \eta_{\text{base}} \\
\hline
\text{MHA equation} & \text{softmax}\left(\frac{Q^T K}{\sqrt{d_{\text{head}}}}\right) V & \text{softmax}\left(\frac{Q^T K}{d_{\text{head}}}\right) V \\
\hline
\text{QKV weights initializer} & W_{\text{qkv}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2) & W_{\text{qkv}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2/m_{\text{width}}) \\
\hline
\text{QKV weights LR} & \eta_{\text{qkv}} = \eta_{\text{base}} & \eta_{\text{qkv}} = \eta_{\text{base}}/m_{\text{width}} \\
\hline
\text{O weights initializer} & W_{\text{o}} \sim N_{\text{trunc}}\!\left(0, \frac{\sigma_{\text{base}}^2}{2 \cdot n_{\text{layers}}}\right) & W_{\text{o}} \sim N_{\text{trunc}}\!\left(0, \frac{\sigma_{\text{base}}^2}{2 m_{\text{width}} \cdot n_{\text{layers}}}\right) \\
\hline
\text{O weights LR} & \eta_{\text{o}} = \eta_{\text{base}} & \eta_{\text{o}} = \eta_{\text{base}}/m_{\text{width}} \\
\hline
\text{ffn1 weights initializer} & W_{\text{ffn1}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2) & W_{\text{ffn1}} \sim N_{\text{trunc}}(0, \sigma_{\text{base}}^2/m_{\text{width}}) \\
\hline
\text{ffn1 weights LR} & \eta_{\text{ffn1}} = \eta_{\text{base}} & \eta_{\text{ffn1}} = \eta_{\text{base}}/m_{\text{width}} \\
\hline
\text{ffn2 weights initializer} & W_{\text{ffn2}} \sim N_{\text{trunc}}\!\left(0, \frac{\sigma_{\text{base}}^2}{2 \cdot n_{\text{layers}}}\right) & W_{\text{ffn2}} \sim N_{\text{trunc}}\!\left(0, \frac{\sigma_{\text{base}}^2}{2 m_{\text{width}} \cdot n_{\text{layers}}}\right) \\
\hline
\text{ffn2 weights LR} & \eta_{\text{ffn2}} = \eta_{\text{base}} & \eta_{\text{ffn2}} = \eta_{\text{base}}/m_{\text{width}} \\
\hline
\text{Output logits multiplier} & Y_{\text{logits}} = W_{unemb}X & Y_{\text{logits}} = W_{unemb}X/m_{\text{width}} \\
\hline
\end{array}
$$

## pre-training strategies

> \[!note\] Data Parallel
>
> replicates model weights per GPUs.