See also: this talk I gave at Hack the North 2023.

reduce computational and memory costs of running inference with representing the weight and activations with low-precision data type

Note

This also applies to post-training quantization, where the methodology is applied after the model has been trained, instead of during load-time.

fp32 to fp16

Does my operation support fp16?

  • CPU does support saving fp16 weights, but computations are done in fp32

Does my operation sensitive to fp16?

For example epsilon in LayerNormalization usually is very small , but smallest value in fp16 is , which cause NaN issues.

fp32 to int8

Consider a float x in [a, b], such that affine quantization scheme:

where:

  • is the quantized int8 associated with x
  • and are scaling and zero-point parameters
    • is the scale, positive float32
    • is the zero-point, or the int8 value corresponding to value 0 in fp32

Thus quantized value is:

And fp32 value outside of [a, b] is clipped to closest representable value.

See also: paper

quantization time

  • Post-training dynamic quantization: range of each activation is computed on the fly at runtime
  • Post-training static quantization: range of each activation is computed offline before runtime
    • Observers are put on activations to collect their value
    • certain number of forward passes on calibration datasets
    • range of each computation are computed according to some calibration technique
  • Quantization aware training: range of each activation is computed during training
    • fake_quantize operations are inserted in the computation graph
    • fake_quantize is a no-op during inference, but during training, it simulates the effect of quantization

Methods and libraries

bitsandbytes and GPTQ