See also: this talk I gave at Hack the North 2023.

reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type

`int16`

- half precision`bfloat16`

`int8`

Note

This also applies to post-training quantization, where the methodology is applied after the model has been trained, instead of during load-time.

`fp32`

to `fp16`

Does my operation support

`fp16`

?

- CPU does support saving
`fp16`

weights, but computations are done in`fp32`

Does my operation

sensitiveto`fp16`

?

For example `epsilon`

in `LayerNormalization`

usually is very small $1e_{−12}$, but smallest value in `fp16`

is $≈6e_{−5}$, which cause `NaN`

issues.

`fp32`

to `int8`

Consider a float `x`

in `[a, b]`

, such that *affine quantization scheme*:

where:

- $x_{q}$ is the quantized
`int8`

associated with`x`

- $S$ and $Z$ are scaling and zero-point parameters
- $S$ is the scale, positive
`float32`

- $Z$ is the zero-point, or the
`int8`

value corresponding to value`0`

in`fp32`

- $S$ is the scale, positive

Thus quantized value $x_{q}$ is: $x_{q}=round(x/S+Z)$

And `fp32`

value outside of `[a, b]`

is clipped to closest representable value.

See also: paper

## quantization time

- Post-training
**dynamic quantization**: range of each activation is computed on the fly at*runtime* - Post-training
**static quantization**: range of each activation is computed*offline*before*runtime*- Observers are put on activations to collect their value
- certain number of forward passes on calibration datasets
- range of each computation are computed according to some
*calibration technique*

**Quantization aware training**: range of each activation is computed*during training*`fake_quantize`

operations are inserted in the computation graph`fake_quantize`

is a no-op during inference, but during training, it simulates the effect of quantization

## Methods and libraries

bitsandbytes and GPTQ