Source: developer.qualcomm
Overview
- Non-quantized DLC files use 32 bit floating point representations of network parameters.
- Quantized DLC files use fixed point representations of network parameters, generally 8 bit weights and 8 or 32bit biases. The fixed point representation is the same used in Tensorflow quantized models.
Quantization Algorithm
- Quantization converts floating point data to Tensorflow-style 8-bit fixed point format
- The following requirements are satisfied:
- Full range of input values is covered.
- Minimum range of 0.01 is enforced.
- Floating point zero is exactly representable.
- Quantization algorithm inputs:
- Set of floating point values to be quantized.
- Quantization algorithm outputs:
- Set of 8-bit fixed point values.
- Encoding parameters:
- encoding-min - minimum floating point value representable (by fixed point value 0)
- encoding-max - maximum floating point value representable (by fixed point value 255)
- Algorithm
- Compute the true range (min, max) of input data.
- Compute the encoding-min and encoding-max.
- Quantize the input floating point values.
- Output:
- fixed point values
- encoding-min and encoding-max parameters
Details
- Compute the true range of the input floating point data.
- finds the smallest and largest values in the input data
- represents the true range of the input data
- Compute the encoding-min and encoding-max.
- These parameters are used in the quantization step.
- These parameters define the range and floating point values that will be representable by the fixed point format.
- encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0
- encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255
- floating point values at every step size, where step size = (encoding-max - encoding-min) / 255, will be representable
- encoding-min and encoding-max are first set to the true min and true max computed in the previous step
- First requirement: encoding range must be at least a minimum of 0.01
- encoding-max is adjusted to max(true max, true min + 0.01)
- Second requirement: floating point value of 0 must be exactly representable
- encoding-min or encoding-max may be further adjusted
- Handling 0.
- Case 1: Inputs are strictly positive
- the encoding-min is set to 0.0
- zero floating point value is exactly representable by smallest fixed point value 0
- e.g. input range = [5.0, 10.0]
- encoding-min = 0.0, encoding-max = 10.0
- Case 2: Inputs are strictly negative
- encoding-max is set to 0.0
- zero floating point value is exactly representable by the largest fixed point value 255
- e.g. input range = [-20.0, -6.0]
- encoding-min = -20.0, encoding-max = 0.0
- Case 3: Inputs are both negative and positive
- encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable
- e.g. input range = [-5.1, 5.1]
- encoding-min and encoding-max are first set to -5.1 and 5.1, respectively
- encoding range is 10.2 and the step size is 10.2/255 = 0.04
- zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively
- encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08
- floating point zero is now exactly representable by the fixed point value of 128
- Case 1: Inputs are strictly positive
- Quantize the input floating point values.
- encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation
- Quantization formula is:
- quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))
- quantized value is also clamped to be within 0 and 255
- Outputs
- the fixed point values
- encoding-min and encoding-max parameters
Quantization Example
- Inputs:
- input values = [-1.8, -1.0, 0, 0.5]
- encoding-min is set to -1.8 and encoding-max to 0.5
- encoding range is 2.3, which is larger than the required 0.01
- encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable
- step size (delta or scale) is 0.009020
- Outputs:
- quantized values are [0, 89, 200, 255]
Dequantization Example
- Inputs:
- quantized values = [0, 89, 200, 255]
- encoding-min = −1.803922, encoding-max = 0.496078
- step size is 0.009020
- Outputs:
- dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]
0 Comments