---
date: '2024-02-05'
description: reducing neural network memory and compute through low-precision data types like int8 and fp16, using calibration and rounding schemes.
id: quantization
modified: 2026-06-05 15:08:06 GMT-04:00
tags:
  - seed
  - ml
title: Quantization
created: '2024-02-05'
published: '2024-02-05'
pageLayout: default
slug: thoughts/quantization
permalink: https://aarnphm.xyz/thoughts/quantization.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
See also: [[thoughts/images/htn-openllm.pdf|this talk]] I gave at Hack the North 2023.

> reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type

- `int16` - [[thoughts/quantization#`fp32` to `fp16`|half precision]]
- `bfloat16`
- `int8`
- `fp8`

> \[!note\] Note
>
> This also applies to post-training quantization, where the methodology is applied after the model has been trained, instead of during load-time.

![[thoughts/images/quantisation-format.webp|from baseten introduction into quantization format]]

## metrics for calibration

the idea is to compare the difference between two probability distribution when scaling, for example from `int16` to `int8`

### [[thoughts/Kullback-Leibler divergence|KL calibration]]

## `fp32 -> fp16`

> Does my operation support `fp16`?

- CPU does support saving `fp16` weights, but computations are done in `fp32`

> Does my operation _sensitive_ to `fp16`?

For example `epsilon` in `LayerNormalization` usually is very small $1e^{-12}$, but smallest value in `fp16` is $\approx 6e^{-5}$, which cause `NaN` issues.

## `fp32 -> int8`

Consider a float `x` in `[a, b]`, such that _affine quantization scheme_:

$$
x = S \cdot (x_q - Z)
$$

where:

- $x_q$ is the quantized `int8` associated with `x`
- $S$ and $Z$ are scaling and zero-point parameters
  - $S$ is the scale, positive `float32`
  - $Z$ is the zero-point, or the `int8` value corresponding to value `0` in `fp32`

Thus quantized value $x_q$ is: $x_q = \text{round}(x / S + Z)$

And `fp32` value outside of `[a, b]` is clipped to closest representable value.

$$
\forall x \in [a, b] \quad x_q = \text{clip}(\text{round}(x/S + Z), \text{round}(a/S + Z), \text{round}(b/S + Z))
$$

[Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](https://arxiv.org/abs/1712.05877) \[@jacob2017quantizationtrainingneuralnetworks\]&#x20;

## `mxfp4`

see also: [specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)

stands for _microscaling (mx) of 4-bit floating-point (fp4)_

[Microscaling Data Formats for Deep Learning](https://arxiv.org/abs/2310.10537) \[@rouhani2023microscalingdataformatsdeep\] , first proposed in Open Compute Project (OCP), backed by OpenAI, AMD, NVIDIA, Microsoft, Meta.

Developed for training, given that FP4 is “good enough” in inference.

- E2M1: 1 sign bit, 2 exponent bit, and 1 mantissa bit per parameter.
- Block: divided into 32 block\_size
- Use a common 8-bit shared scale, best fit all values in a block.
- The value is decoded as:
  $$
  X_i = P_i \times S
  $$
  where $X_i$ is the reconstructed value, $P_i$ is the FP4 quantized value, and $S$ denotes the shared scale.

To preserve gradient integrity:

- Stochastic Rounding: Randomizes rounding direction, ensuring no systematic loss of information during training updates prevents bias and preserves learning progress.
- Random Hadamard Transform
- Group-wise Quantization

<div class="ps-root" data-inline-macros=""><span type="button" class="clipboard-button ps-clipboard" aria-label="Copy pseudocode to clipboard"><svg width="16" height="16" viewBox="0 0 16 16" class="copy-icon"><use href="#github-copy"></use></svg><svg width="16" height="16" viewBox="0 0 16 16" class="check-icon"><use href="#github-check" fill-rule="evenodd" fill="rgb(63, 185, 80)"></use></svg></span><span class="ps-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><annotation encoding="application/x-tex">"\\begin{algorithm}\n\\caption{Convert vector of scalar floats $\\{V_i\\}_{i=1}^k$ to an MX block $\\{X,\\{P_i\\}_{i=1}^k\\}$}\n\\begin{algorithmic}\n\\Require $e^{\\max}_{\\text{elem}}$ \\Comment{exponent of the largest normal number in the element data format}\n\\State $\\text{shared\\_exp} \\gets \\left\\lfloor \\log_2\\!\\left(\\max_i |V_i|\\right) \\right\\rfloor - e^{\\max}_{\\text{elem}}$\n\\State $X \\gets 2^{\\text{shared\\_exp}}$\n\\For{$i = 1$ \\textbf{to} $k$}\n    \\State $P_i \\gets \\text{quantize\\_to\\_element\\_format}\\!\\left(\\frac{V_i}{X}\\right)$ \\Comment{clamp to normal-number range}\n\\EndFor\n\\State \\textbf{return} $X,\\ \\{P_i\\}_{i=1}^{k}$\n\\end{algorithmic}\n\\end{algorithm}"</annotation></semantics></math></span>
<div class="ps-algorithm with-caption">
<p class="ps-line" style="text-indent:-0.6em;padding-left:0.6em;">
<span class="ps-keyword">Algorithm 1 </span>Convert vector of scalar floats <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">{</mo><msub><mi>V</mi><mi>i</mi></msub><msubsup><mo stretchy="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsup></mrow><annotation encoding="application/x-tex">\{V_i\}_{i=1}^k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1078em;vertical-align:-0.2587em;"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.2222em;">V</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.2222em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose"><span class="mclose">}</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8491em;"><span style="top:-2.4413em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mrel mtight">=</span><span class="mord mtight">1</span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2587em;"><span></span></span></span></span></span></span></span></span></span> to an MX block <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">{</mo><mi>X</mi><mo separator="true">,</mo><mo stretchy="false">{</mo><msub><mi>P</mi><mi>i</mi></msub><msubsup><mo stretchy="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsup><mo stretchy="false">}</mo></mrow><annotation encoding="application/x-tex">\{X,\{P_i\}_{i=1}^k\}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1078em;vertical-align:-0.2587em;"></span><span class="mopen">{</span><span class="mord mathnormal" style="margin-right:0.0785em;">X</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose"><span class="mclose">}</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8491em;"><span style="top:-2.4413em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mrel mtight">=</span><span class="mord mtight">1</span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2587em;"><span></span></span></span></span></span></span><span class="mclose">}</span></span></span></span></p>
<div class="ps-algorithmic with-linenum">
<p class="ps-line" style="text-indent:-0.6em;padding-left:0.6em;">
<span class="ps-keyword">Require: </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msubsup><mi>e</mi><mtext>elem</mtext><mi>max</mi><mo>⁡</mo></msubsup></mrow><annotation encoding="application/x-tex">e^{\max}_{\text{elem}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9475em;vertical-align:-0.2831em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.6644em;"><span style="top:-2.4169em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">elem</span></span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mop mtight"><span class="mtight">m</span><span class="mtight">a</span><span class="mtight">x</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2831em;"><span></span></span></span></span></span></span></span></span></span></p>
<div class="ps-block" style="margin-left:1.2em;">
<span class="ps-comment">  ▷exponent of the largest normal number in the element data format</span><p class="ps-line ps-code">
<span class="ps-linenum" style="left:0em;">1:</span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mtext>shared_exp</mtext><mo>←</mo><mrow><mo fence="true">⌊</mo><msub><mrow><mi>log</mi><mo>⁡</mo></mrow><mn>2</mn></msub><mtext> ⁣</mtext><mrow><mo fence="true">(</mo><msub><mrow><mi>max</mi><mo>⁡</mo></mrow><mi>i</mi></msub><mi mathvariant="normal">∣</mi><msub><mi>V</mi><mi>i</mi></msub><mi mathvariant="normal">∣</mi><mo fence="true">)</mo></mrow><mo fence="true">⌋</mo></mrow><mo>−</mo><msubsup><mi>e</mi><mtext>elem</mtext><mi>max</mi><mo>⁡</mo></msubsup></mrow><annotation encoding="application/x-tex">\text{shared\_exp} \gets \left\lfloor \log_2\!\left(\max_i |V_i|\right) \right\rfloor - e^{\max}_{\text{elem}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0044em;vertical-align:-0.31em;"></span><span class="mord text"><span class="mord">shared_exp</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">⌊</span><span class="mop"><span class="mop">lo<span style="margin-right:0.0139em;">g</span></span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.207em;"><span style="top:-2.4559em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2441em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;">(</span><span class="mop"><span class="mop">max</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">∣</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.2222em;">V</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.2222em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord">∣</span><span class="mclose delimcenter" style="top:0em;">)</span></span><span class="mclose delimcenter" style="top:0em;">⌋</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.9475em;vertical-align:-0.2831em;"></span><span class="mord"><span class="mord mathnormal">e</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.6644em;"><span style="top:-2.4169em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">elem</span></span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mop mtight"><span class="mtight">m</span><span class="mtight">a</span><span class="mtight">x</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2831em;"><span></span></span></span></span></span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="ps-linenum" style="left:0em;">2:</span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>X</mi><mo>←</mo><msup><mn>2</mn><mtext>shared_exp</mtext></msup></mrow><annotation encoding="application/x-tex">X \gets 2^{\text{shared\_exp}}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.0785em;">X</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.8491em;"></span><span class="mord"><span class="mord">2</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8491em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord mtight">shared_exp</span></span></span></span></span></span></span></span></span></span></span></span></span></p>
<p class="ps-line ps-code">
<span class="ps-linenum" style="left:0em;">3:</span><span class="ps-keyword">for </span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">i = 1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6595em;"></span><span class="mord mathnormal">i</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span> <span style="font-weight:bold;">to</span> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>k</mi></mrow><annotation encoding="application/x-tex">k</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.0315em;">k</span></span></span></span><span class="ps-keyword"> do</span></p>
<div class="ps-block" style="margin-left:0.6em;">
<p class="ps-line ps-code">
<span class="ps-linenum" style="left:-0.75em;">4:</span><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>P</mi><mi>i</mi></msub><mo>←</mo><mtext>quantize_to_element_format</mtext><mtext> ⁣</mtext><mrow><mo fence="true">(</mo><mfrac><msub><mi>V</mi><mi>i</mi></msub><mi>X</mi></mfrac><mo fence="true">)</mo></mrow></mrow><annotation encoding="application/x-tex">P_i \gets \text{quantize\_to\_element\_format}\!\left(\frac{V_i}{X}\right)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8333em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">←</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.2384em;vertical-align:-0.35em;"></span><span class="mord text"><span class="mord">quantize_to_element_format</span></span><span class="mspace" style="margin-right:-0.1667em;"></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner"><span class="mopen delimcenter" style="top:0em;"><span class="delimsizing size1">(</span></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8884em;"><span style="top:-2.655em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0785em;">X</span></span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.4101em;"><span class="pstrut" style="height:3em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.2222em;">V</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3281em;"><span style="top:-2.357em;margin-left:-0.2222em;margin-right:0.0714em;"><span class="pstrut" style="height:2.5em;"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.143em;"><span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.345em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mclose delimcenter" style="top:0em;"><span class="delimsizing size1">)</span></span></span></span></span></span><span class="ps-comment">  ▷clamp to normal-number range</span></p>
</div>
<p class="ps-line ps-code">
<span class="ps-linenum" style="left:0em;">5:</span><span class="ps-keyword">end for</span></p>
<p class="ps-line ps-code">
<span class="ps-linenum" style="left:0em;">6:</span><span style="font-weight:bold;">return</span> <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>X</mi><mo separator="true">,</mo><mtext> </mtext><mo stretchy="false">{</mo><msub><mi>P</mi><mi>i</mi></msub><msubsup><mo stretchy="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsup></mrow><annotation encoding="application/x-tex">X,\ \{P_i\}_{i=1}^{k}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1078em;vertical-align:-0.2587em;"></span><span class="mord mathnormal" style="margin-right:0.0785em;">X</span><span class="mpunct">,</span><span class="mspace"> </span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.1389em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose"><span class="mclose">}</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8491em;"><span style="top:-2.4413em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">i</span><span class="mrel mtight">=</span><span class="mord mtight">1</span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.2587em;"><span></span></span></span></span></span></span></span></span></span></p>
</div>
</div>
</div>
</div>

![[thoughts/images/compute-flow-mxformat.webp]]

## quantization time

- Post-training **dynamic quantization**: range of each activation is computed on the fly at _runtime_
- Post-training **static quantization**: range of each activation is computed _offline_ before _runtime_
  - Observers are put on activations to collect their value
  - certain number of forward passes on calibration datasets
  - range of each computation are computed according to some _calibration technique_
- **Quantization aware training**: range of each activation is computed _during training_
  - `fake_quantize` operations are inserted in the computation graph
  - `fake_quantize` is a no-op during inference, but during training, it simulates the effect of quantization

## methods

[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 

### GPTQ

[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) \[@frantar2023gptqaccurateposttrainingquantization\]&#x20;

## floating point

IEEE 754 binary interchange format stores every floating-point scalar as a triple of sign, exponent, and fraction (mantissa) bits.

> \[!definition\] Definition 1. components
>
> - `sign bit (s)`: 0 encodes positive, 1 encodes negative; contributes the factor $(-1)^s$.
> - `exponent field (e)`: $k$-bit unsigned integer with bias $B = 2^{k-1}-1$ for binary formats; controls the power-of-two scaling.
> - `fraction field (f)`: $m$-bit significand suffix. The leading 1 is implicit for normal numbers, making the significand $1.f$.

| format     | total bits | sign bits | exponent bits | fraction bits | bias $B$ |
| ---------- | ---------- | --------- | ------------- | ------------- | -------- |
| `fp32`     | 32         | 1         | 8             | 23            | 127      |
| `fp16`     | 16         | 1         | 5             | 10            | 15       |
| `bfloat16` | 16         | 1         | 8             | 7             | 127      |
| `fp8 e5m2` | 8          | 1         | 5             | 2             | 15       |
| `fp8 e4m3` | 8          | 1         | 4             | 3             | 7        |

for normal (non-zero, non-inf, non-NaN) encodings the value is

$$
x = (-1)^s \times \left(1 + \frac{f}{2^{m}}\right) \times 2^{(e - B)}.
$$

subnormal numbers have $e = 0$; the hidden leading 1 disappears and the exponent becomes $1-B$:

$$
x_{\text{sub}} = (-1)^s \times \left(\frac{f}{2^{m}}\right) \times 2^{1-B}.
$$

special patterns:

- $e = 0$ and $f = 0$ encode signed zero.
- $e = 2^k - 1$ and $f = 0$ encode $\pm \infty$.
- $e = 2^k - 1$ and $f \neq 0$ encode NaNs (quiet NaNs usually set the top bit of $f$).

> \[!example\] decoding `0x40490fdb` (`fp32`)
>
> - bits: `0` | `10000000` | `10010010000111111011011`
> - $s = 0$, $e = 128$, $f = 0x490fdb = 4\,788\,187$
> - unbiased exponent: $128 - 127 = 1$
> - significand: $1 + \frac{4\,788\,187}{2^{23}} \approx 1.57079637$
> - value: $(-1)^0 \times 1.57079637 \times 2^1 \approx 3.14159274$
>   this recovers $\pi$ to the precision of `fp32`.

rounding to the nearest even significand is mandated by ieee 754; packing a higher-precision result into `fp16` or `fp8` therefore requires first scaling the exponent (clamping if overflow), then rounding the fraction field to its shorter width. formats with more exponent bits (e.g., `bfloat16`) trade mantissa precision for a wider dynamic range, while formats with more fraction bits (e.g., `fp16`) provide finer granularity inside a smaller range.

