SNPE Quantization Algorithm

 Source: developer.qualcomm

Overview

  • Non-quantized DLC files use 32 bit floating point representations of network parameters.
  • Quantized DLC files use fixed point representations of network parameters, generally 8 bit weights and 8 or 32bit biases. The fixed point representation is the same used in Tensorflow quantized models.

Quantization Algorithm

  • Quantization converts floating point data to Tensorflow-style 8-bit fixed point format
  • The following requirements are satisfied:
    • Full range of input values is covered.
    • Minimum range of 0.01 is enforced.
    • Floating point zero is exactly representable.
  • Quantization algorithm inputs:
    • Set of floating point values to be quantized.
  • Quantization algorithm outputs:
    • Set of 8-bit fixed point values.
    • Encoding parameters:
      • encoding-min - minimum floating point value representable (by fixed point value 0)
      • encoding-max - maximum floating point value representable (by fixed point value 255)
  • Algorithm
    1. Compute the true range (min, max) of input data.
    2. Compute the encoding-min and encoding-max.
    3. Quantize the input floating point values.
    4. Output:
      • fixed point values
      • encoding-min and encoding-max parameters

Details

  1. Compute the true range of the input floating point data.
    • finds the smallest and largest values in the input data
    • represents the true range of the input data
  2. Compute the encoding-min and encoding-max.
    • These parameters are used in the quantization step.
    • These parameters define the range and floating point values that will be representable by the fixed point format.
      • encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0
      • encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255
      • floating point values at every step size, where step size = (encoding-max - encoding-min) / 255, will be representable
    1. encoding-min and encoding-max are first set to the true min and true max computed in the previous step
    2. First requirement: encoding range must be at least a minimum of 0.01
      • encoding-max is adjusted to max(true max, true min + 0.01)
    3. Second requirement: floating point value of 0 must be exactly representable
      • encoding-min or encoding-max may be further adjusted
  3. Handling 0.
    1. Case 1: Inputs are strictly positive
      • the encoding-min is set to 0.0
      • zero floating point value is exactly representable by smallest fixed point value 0
      • e.g. input range = [5.0, 10.0]
        • encoding-min = 0.0, encoding-max = 10.0
    2. Case 2: Inputs are strictly negative
      • encoding-max is set to 0.0
      • zero floating point value is exactly representable by the largest fixed point value 255
      • e.g. input range = [-20.0, -6.0]
        • encoding-min = -20.0, encoding-max = 0.0
    3. Case 3: Inputs are both negative and positive
      • encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable
      • e.g. input range = [-5.1, 5.1]
        • encoding-min and encoding-max are first set to -5.1 and 5.1, respectively
        • encoding range is 10.2 and the step size is 10.2/255 = 0.04
        • zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively
        • encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08
        • floating point zero is now exactly representable by the fixed point value of 128
  4. Quantize the input floating point values.
    • encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation
    • Quantization formula is:
      • quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))
    • quantized value is also clamped to be within 0 and 255
  5. Outputs
    • the fixed point values
    • encoding-min and encoding-max parameters

Quantization Example

  • Inputs:
    • input values = [-1.8, -1.0, 0, 0.5]
  • encoding-min is set to -1.8 and encoding-max to 0.5
  • encoding range is 2.3, which is larger than the required 0.01
  • encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable
  • step size (delta or scale) is 0.009020
  • Outputs:
    • quantized values are [0, 89, 200, 255]

Dequantization Example

  • Inputs:
    • quantized values = [0, 89, 200, 255]
    • encoding-min = −1.803922, encoding-max = 0.496078
  • step size is 0.009020
  • Outputs:
    • dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]

Post a Comment

0 Comments