Quantize a Model

Inference Engine imports model constants and weights as 32-bit values. To reduce the model's storage size on disk and memory, use model quantization.

Quantization represents the weight values in a lower-precision format. At runtime, Inference Engine converts these values back to a higher-precision format before processing the operations.

Quantization types

Inference Engine currently supports the following quantization types.

Quantization type	Bits per value	Description
None	32-bit	Stores the value in full precision.
Float16	16-bit	Converts the values to a 16-bit floating point. Often preserves accuracy close to the original model.
Uint8	8-bit	Linearly quantizes values between the highest and lowest range. Might significantly impact accuracy depending on the model.

A lower bit count per value decreases your model’s disk and memory usage without significantly affecting inference speed.

Note

Inference Engine only quantizes float weights used as inputs to specific operations, such as Dense, MatMul, or Conv. Integer constants remain unchanged.

The impact of quantization on model accuracy varies depending on the model type. The best way to evaluate model quantization is to test it and compare performance and accuracy.

Quantizing a loaded model

To quantize a model in code, follow these steps:

Use the ModelQuantizer API to apply quantization to the model.
Use the ModelWriter API to serialize and save the quantized model to disk.

using Unity.InferenceEngine;

void QuantizeAndSerializeModel(Model model, string path)
{
    // Inference Engine destructively edits the source model in memory when quantizing.
    ModelQuantizer.QuantizeWeights(QuantizationType.Float16, ref model);

    // Serialize the quantized model to a file.
    ModelWriter.Save(path, model);
}

Quantize a Model

Quantization types

Note

Quantizing a loaded model

Additional resources