You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
347 lines
14 KiB
347 lines
14 KiB
# Building a quantization paradigm from first principles
|
|
|
|
**TLDR:** If you prefer example code over theory, look at
|
|
[doc/quantization_example.cc](quantization_example.cc).
|
|
|
|
## Overview
|
|
|
|
gemmlowp allows to perform calculations on matrices on uint8 values, but these
|
|
matrices are only useful insofar as they somehow approximate matrices of real
|
|
numbers. By a _quantization paradigm_ we mean a correspondence between matrices
|
|
of quantized 8bit values and matrices of real numbers. The choice of a
|
|
quantization paradigm affects the calculations that gemmlowp itself needs to
|
|
perform, specifically, it affects how one goes from internal 32bit accumulator
|
|
to final 8bit outputs.
|
|
|
|
The part of gemmlowp transforming internal 32bit accumulator to final
|
|
8bit outputs is the "output pipeline" described in [output.md](output.md).
|
|
|
|
gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary
|
|
output pipeline, allowing the user to implement their own preferred quantized
|
|
arithmetic paradigm.
|
|
|
|
In the present document, our purpose is to show how, reasoning from first
|
|
principles and some domain-specific knowledge of neural networks, we can arrive
|
|
naturally at some specific quantization paradigm, and how that can be
|
|
implemented using a specific output pipeline.
|
|
|
|
We also aim to show how that differs from the older, legacy quantization
|
|
paradigm implemented by gemmlowp's legacy interfaces and why the change to the
|
|
newer quantization paradigm described in this document was useful as far as some
|
|
applications of gemmlowp were concerned.
|
|
|
|
## Quantization as an affine map.
|
|
|
|
In order for arithmetic on real values to map directly to arithmetic on
|
|
quantized uint8 values, the mapping between real and quantized uint8 values must
|
|
be affine, which means it must be of the form
|
|
|
|
```
|
|
real_value = A * quantized_value + B (1)
|
|
```
|
|
|
|
for some constants A, B, or equivalently, of the form
|
|
|
|
```
|
|
real_value = C * (quantized_value + D) (2)
|
|
```
|
|
|
|
for some constants C, D. Indeed, anything else than such an affine map would
|
|
mean that the result of the quantized calculations do no longer readily provide
|
|
an approximation to the result of the real-numbers calculation.
|
|
|
|
## Domain-specific constraint: the real value 0 must be exactly representable.
|
|
|
|
Here a domain-specific constrain from neural networks appears: for some neural
|
|
network layers, it is very useful for optimized implementations that the
|
|
real-value 0 be exactly representable.
|
|
|
|
For instance, in a Convolutional or Pooling layer with padding, it is useful to
|
|
be able to implement the padding by zero-padding the input array, so that
|
|
optimized loops do not need to become more complex to avoid overrunning the
|
|
array bounds.
|
|
|
|
In order for such zero-padding to be feasible in a quantized implementation of
|
|
such layers, it is important that the real value '0' be exactly representable in
|
|
quantized form, i.e. that it correspond exactly to some quantized value, which
|
|
we call the _zero-point_.
|
|
|
|
Indeed, if '0' were not exactly representable, then we would have to use some
|
|
quantized value for padding, that does not exactly correspond to the real value
|
|
'0'. That would typically introduce inaccuracy in the result. In fact, using
|
|
always the same such value would be worse: it would introduce _bias_ in the
|
|
result.
|
|
|
|
## The final form of the quantization equation
|
|
|
|
Now let us phrase what this constraint — that the real value 0 be exactly
|
|
representable — means in either quantization equations, (1) and (2).
|
|
|
|
In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`,
|
|
we get:
|
|
|
|
```
|
|
0 = A * zero_point + B
|
|
```
|
|
|
|
equivalently:
|
|
|
|
```
|
|
zero_point = -B / A
|
|
```
|
|
|
|
We are thus left with a rather awkward constraint: the real number `-B / A` must
|
|
somehow be guaranteed to be exactly integral, so that the special uint8 value
|
|
`zero_point` can be exactly equal to it. Quite awkward!
|
|
|
|
Now let us look at equation (2). Plugging `real_value = 0` and
|
|
`quantized_value = zero_point`, we get:
|
|
|
|
```
|
|
0 = C * (zero_point + D)
|
|
```
|
|
|
|
Conveniently, the constant `C` plays no role anymore, so this equation
|
|
simplifies to:
|
|
|
|
```
|
|
0 = zero_point + D
|
|
```
|
|
|
|
In other words, `D = -zero_point`. This suggests rewriting the quantization
|
|
equation (2) into the following form (3), which will be the final form that we
|
|
will consistently use:
|
|
|
|
```
|
|
real_value = scale * (quantized_value - zero_point) (3)
|
|
```
|
|
|
|
To go from (2) to (3), we merely renamed `C` to `scale` and `D` to
|
|
`-zero_point`.
|
|
|
|
With this quantization equation (3), the condition that 0 be exactly
|
|
representable is vacuously satisfied: `zero_point` is by definition one of the
|
|
possible `quantized_value`'s, and equation (3) maps it to a `real_value` of
|
|
exactly 0.
|
|
|
|
Note that the final quantizaton equation (3) depends on two constants, one
|
|
integral, the other an arbitrary positive real number:
|
|
|
|
* `zero_point` is integral, more specifically is one of the possible quantized
|
|
values (i.e. typically is a uint8 value).
|
|
* `scale` is a positive real number. Thus at this stage we have not yet shown
|
|
how to eliminate all usage of floating-point arithmetic. That will come
|
|
below.
|
|
|
|
## Quantizing a matrix multiplication
|
|
|
|
Now that we know — equation (3) — how real numbers are to correspond
|
|
to quantized values (typically uint8), we turn to applying this knowledge to
|
|
rewriting a multiplication of matrices of real numbers, by the equivalent
|
|
multiplication of matrices of quantized values.
|
|
|
|
Say that we have two matrices of real values `lhs_real_matrix`,
|
|
`rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many
|
|
products of individual matrix entries, say `lhs_real_value * rhs_real_value`.
|
|
|
|
Now suppose that we have already quantized these two matrices according to the
|
|
above equation (3), with some already-known quantization parameters `lhs_scale`,
|
|
`rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries
|
|
are quantized as
|
|
|
|
```
|
|
lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point)
|
|
rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
|
|
```
|
|
|
|
We then rewrite the matrix product accumulator accordingly:
|
|
|
|
```
|
|
result_real_value
|
|
= Sum_over_i(lhs_real_value[i] * rhs_real_value[i])
|
|
= Sum_over_i(
|
|
lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) *
|
|
rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
|
|
)
|
|
= lhs_scale * rhs_scale * Sum_over_i(
|
|
(lhs_quantized_value[i] - lhs_zero_point) *
|
|
(rhs_quantized_value[i] - rhs_zero_point)
|
|
) (4)
|
|
```
|
|
|
|
Now our goal is to represent this result itself as a quantized matrix, i.e.
|
|
still according to equation (3), for some pre-established quantization
|
|
parameters `result_scale` and `result_zero_point`, as
|
|
|
|
```
|
|
result_real_value = result_scale *
|
|
(result_quantized_value - result_zero_point)
|
|
```
|
|
|
|
Here we need to keep in mind that our goal is to specify what the quantized
|
|
matrix multiplication should do, i.e. how to compute `result_quantized_value`.
|
|
The last equation above is equivalent to
|
|
|
|
```
|
|
result_quantized_value = result_zero_point +
|
|
result_real_value / result_scale
|
|
```
|
|
|
|
Now we can use equation (4) above to plug into this the expression of
|
|
result_real_value in terms of the quantized operands, and we obtain:
|
|
|
|
```
|
|
result_quantized_value = result_zero_point +
|
|
(lhs_scale * rhs_scale / result_scale) *
|
|
Sum_over_i(
|
|
(lhs_quantized_value[i] - lhs_zero_point) *
|
|
(rhs_quantized_value[i] - rhs_zero_point)
|
|
) (5)
|
|
```
|
|
|
|
Equation (5) is the conclusion of this general discussion of how to specify what
|
|
"quantized matrix multiplication" should actually compute, in order to be able
|
|
to replace real matrix multiplications.
|
|
|
|
## Implementation of quantized matrix multiplication
|
|
|
|
Having obtained the mathematical form (5) of quantized matrix multiplication, we
|
|
now turn to its actual implementation.
|
|
|
|
The inner-most part of (5),
|
|
|
|
```
|
|
int32_accumulator =
|
|
Sum_over_i(
|
|
(lhs_quantized_value[i] - lhs_zero_point) *
|
|
(rhs_quantized_value[i] - rhs_zero_point)
|
|
)
|
|
```
|
|
|
|
is the "kernel" accumulation loop. It is where the bulk of the computational
|
|
cost goes. Luckily, it only involves integers: the quantized operands matrix
|
|
entries, and their `zero_point` quantization parameters. Typically, all of these
|
|
values are uint8. Typically, the above differences of uint8 values would be
|
|
represented as signed int16; their products as signed int32.
|
|
|
|
It is out of scope of the present doc to discuss how to avoid the overhead of
|
|
having to subtract these `zero_point` constants in this inner loop; refer to
|
|
[this section of
|
|
low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The
|
|
gist of it is that a mathematical trick allows us to take the handling of these
|
|
`zero_point` constants out of this accumulation loop, so that it simplifies to
|
|
|
|
```
|
|
int32_accumulator =
|
|
Sum_over_i(
|
|
lhs_quantized_value[i] *
|
|
rhs_quantized_value[i]
|
|
) (6)
|
|
```
|
|
|
|
Anyway, the result is a `int32_accumulator` that we now plug back into the rest
|
|
of (5):
|
|
|
|
```
|
|
result_quantized_value = result_zero_point +
|
|
(lhs_scale * rhs_scale / result_scale) * int32_accumulator (7)
|
|
```
|
|
|
|
The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)`
|
|
is a positive real number, not an integer in general. It is a constant, though.
|
|
So what we have to implement here is the (approximate) scaling of a int32 value
|
|
by some arbitrary positive constant multiplier.
|
|
|
|
Moreover, it is safe to assume that this positive constant multiplier is smaller
|
|
than one — each of the `scale` values here is typically smaller than one,
|
|
as we are typically mapping the `[0..255]` quantized uint8 value range to an
|
|
interval of real values that is much narrower than that, typically within
|
|
`[-10,10]` in most neural networks. For example, a neural network using Relu6
|
|
activation functions will typically have real activation values in the interval
|
|
[0,6].
|
|
|
|
So how do we implement the multiplication of a int32 value by a positive real
|
|
constant that is smaller than one? Typically, by multiplying by a fixed-point
|
|
constant multiplier in the normalized interval `[1/2,1)`, and right-shifting
|
|
the result to achieve the correct multiplier.
|
|
|
|
At this point we have obtained the int32 value of the product
|
|
|
|
```
|
|
(lhs_scale * rhs_scale / result_scale) * int32_accumulator
|
|
```
|
|
|
|
Looking at (7), it only remains to add to it the integral value
|
|
`result_zero_point`, and we are done.
|
|
|
|
## How this is implemented in gemmlowp
|
|
|
|
The different parts of gemmlowp implementing aspects of the above discussion
|
|
are:
|
|
|
|
* The packing stage (see [packing.md](packing.md)) implements the special
|
|
mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to
|
|
above, see [this section of
|
|
low-precision.md](low-precision.md#efficient-handling-of-offsets) for
|
|
details. Thanks to is, the rest of the calculation can proceed as if
|
|
`lhs_offset`, `rhs_offset` were 0.
|
|
|
|
* The compute/kernel stage (see [kernel.md](kernel.md)) performs the core
|
|
accumulation loop producing the `int32_accumulator`, see equation (6) above.
|
|
|
|
* The unpacking stage feeds into the output pipeline (see
|
|
[output.md](output.md)), which implements the rest of the evaluation of the
|
|
above equation (5), that we discussed in the previous section.
|
|
|
|
Now, the point of gemmlowp's flexible output-pipelines mechanism (see
|
|
[output.md](output.md)) is to support different quantization paradigms, so we
|
|
now have to specify which particular flavor of output pipeline corresponds to
|
|
the particular quantization paradigm that we detailed above in this document.
|
|
|
|
The specific output pipeline stage implementing the present quantization
|
|
paradigm, i.e. implementing the precise computation detailed in the previous
|
|
section (equation (5)), is
|
|
`OutputStageQuantizeDownInt32ByFixedPoint`.
|
|
|
|
Please refer to the comment explaining it in
|
|
[public/output_stages.h](../public/output_stages.h).
|
|
|
|
## How this differs from the older legacy gemmlowp quantization paradigm
|
|
|
|
The difference between the older legacy quantization paradigm described in
|
|
[low-precision.md](low-precision.md) and the newer one described in this
|
|
document boils down to the difference between the legacy output stage
|
|
implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
|
|
stage implementing the new paradigm,
|
|
`OutputStageQuantizeDownInt32ByFixedPoint`.
|
|
|
|
Please refer to the comments in
|
|
[public/output_stages.h](../public/output_stages.h) for details about these two
|
|
output stages and how they differ.
|
|
|
|
Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
|
|
|
|
1. The int32 accumulators (inputs to the output stage) undergo a plain int32
|
|
multiplication with a int32 multiplier, which may overflow. By contrast, in
|
|
the newer `OutputStageQuantizeDownInt32ByFixedPoint`, this
|
|
integer multiplication becomes a fixed-point multiplication and cannot
|
|
overflow.
|
|
|
|
* In practice, to limit the risk of overflow, this pushes users to choose
|
|
smaller values for this integer multiplier, which means limited
|
|
multiplicative accuracy, which may cause multiplicative bias depending
|
|
on how it is used.
|
|
|
|
2. Note how the order of multiplying by the multipler and adding the
|
|
`result_offset` are swapped. This reflects a quantizatin equation of the
|
|
form (1) above, as opposed to the form (2)/(3) that the new quantization
|
|
paradigm uses. As a result, it is essentially impossible to guarantee that 0
|
|
is an exactly-representable value, which as discussed above is an issue at
|
|
least in some convolutional neural network applications.
|
|
|
|
## Example code illustrating the new quantization paradigm
|
|
|
|
Example code showing how to perfom a quantized matrix multiplication in the
|
|
quantization paradigm discussed here is in
|
|
[doc/quantization_example.cc](quantization_example.cc).
|