⌘ '

raccourcis clavier

Quantization

See also: this talk I gave at Hack the North 2023.

reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type

int16 - half precision
bfloat16
int8
fp8

Note

This also applies to post-training quantization, where the methodology is applied after the model has been trained, instead of during load-time.

from baseten introduction into quantization format

metrics for calibration

the idea is to compare the difference between two probability distribution when scaling, for example from int16 to int8

KL calibration

`fp32` to `fp16`

Does my operation support fp16?

CPU does support saving fp16 weights, but computations are done in fp32

Does my operation sensitive to fp16?

For example epsilon in LayerNormalization usually is very small $1e^{-12}$ , but smallest value in fp16 is $\approx 6e^{-5}$ , which cause NaN issues.

`fp32` to `int8`

Consider a float x in [a, b], such that affine quantization scheme:

x = S \cdot (x_q - Z)

where:

$x_q$ is the quantized int8 associated with x
$S$ $S$ and $Z$ $Z$ are scaling and zero-point parameters
- $S$ is the scale, positive float32
- $Z$ is the zero-point, or the int8 value corresponding to value 0 in fp32

Thus quantized value $x_q$ is: $x_q = \text{round}(x / S + Z)$

And fp32 value outside of [a, b] is clipped to closest representable value.

\forall x \in [a, b] \quad x_q = \text{clip}(\text{round}(x/S + Z), \text{round}(a/S + Z), \text{round}(b/S + Z))

quantization time

Post-training dynamic quantization: range of each activation is computed on the fly at runtime
Post-training static quantization: range of each activation is computed offline before runtime
- Observers are put on activations to collect their value
- certain number of forward passes on calibration datasets
- range of each computation are computed according to some calibration technique
Quantization aware training: range of each activation is computed during training
- fake_quantize operations are inserted in the computation graph
- fake_quantize is a no-op during inference, but during training, it simulates the effect of quantization

Methods and libraries

bitsandbytes and GPTQ arXiv

Quantization

Étiquette

publié à

modifié à

durée

source

metrics for calibration

KL calibration

`fp32` to `fp16`

`fp32` to `int8`

quantization time

Methods and libraries

Vous pourriez aimer ce qui suit

Liens retour

Quantization

Étiquette

publié à

modifié à

durée

source

metrics for calibration

KL calibration

fp32 to fp16

fp32 to int8

quantization time

Methods and libraries

Vous pourriez aimer ce qui suit

Liens retour

`fp32` to `fp16`

`fp32` to `int8`